Twitter is a popular microblogging service where users create status messages (called
“tweets”). These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order. This format of presentation is useful to the user since the
latest tweets from are rich on recent news which is generally more interesting than tweets about
an event that occurred long time back. Merely, presenting tweets in a chronological order may
be too embarrassing to the user, especially if he has many followers. Therefore, there is a need
to separate the tweets into different categories and then present the categories to the user.
Nowadays Text Categorization (TC) becomes more significant especially for the Arabic
language which is one of the most complex languages.
In this paper, in order to improve the accuracy of tweets categorization a system based on
Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness
of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian
classifier and the Support Vector Machine classifier.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
The effect of training set size in authorship attribution: application on sho...IJECEIAES
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviours. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%.
A novel method for arabic multi word term extractionijdms
Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any Arabic Text Mining
applications such as Categorization, Clustering, Information Retrieval System, Machine Translation, and
Summarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in three
approaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present some
drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good
accuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method for
AMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: the
Linguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)
Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statistical
filter incorporate the contextual information, and a new proposed association measure based on Termhood
and Unithood Estimation named NTC-Value.
To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative
study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic
filter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method based
Statistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statistical
filter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. The
obtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperforms
the other ones and can deal correctly with the tri-grams terms.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
An Automatic Question Paper Generation : Using Bloom's TaxonomyIRJET Journal
This document presents a system for automatically generating exam questions and categorizing them according to Bloom's Taxonomy. It uses natural language processing techniques like part-of-speech tagging to analyze questions and identify keywords and verbs. Rules are developed to match question patterns and keywords to the appropriate Bloom's Taxonomy category. A randomization algorithm is also introduced to randomly select questions from the database and avoid repetitions. The system aims to help educators automatically analyze exam questions and ensure a balance of cognitive levels according to Bloom's Taxonomy. Preliminary results found the rules could successfully categorize questions in the test set. The proposed system has applications for educational institutions, universities, and government exams.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
MULTI-WORD TERM EXTRACTION BASED ON NEW HYBRID APPROACH FOR ARABIC LANGUAGEcsandit
Arabic Multiword Term are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any text mining
applications such as Categorisation, Clustering, Information Retrieval System, Machine
Translation, and Summarization, etc. This paper introduces our proposed Multiword term
extraction system based on the contextual information. In fact, we propose a new method based
a hybrid approach for Arabic Multiword term extraction. Like other method based on hybrid
approach, our method is composed by two main steps: the Linguistic approach and the
Statistical one. In the first step, the Linguistic approach uses Part Of Speech (POS) Tagger
(Taani’s Tagger) and the Sequence Identifier as patterns in order to extract the candidate
AMTWs. While in the second one which includes our main contribution, the Statistical approach
incorporates the contextual information by using a new proposed association measure based on
Termhood and Unithood for AMWTs extraction. To evaluate the efficiency of our proposed
method for AMWTs extraction, this later has been tested and compared using three different
association measures: the proposed one named NTC-Value, NC-Value, and C-Value. The
experimental results using Arabic Texts taken from the environment domain, show that our
hybrid method outperforms the other ones in term of precision, in addition, it can deal correctly
with tri-gram Arabic Multiword terms.
Arabic text categorization algorithm using vector evaluation methodijcsit
Text categorization is the process of grouping documents into categories based on their contents. This
process is important to make information retrieval easier, and it became more important due to the huge
textual information available online. The main problem in text categorization is how to improve the
classification accuracy. Although Arabic text categorization is a new promising field, there are a few
researches in this field. This paper proposes a new method for Arabic text categorization using vector
evaluation. The proposed method uses a categorized Arabic documents corpus, and then the weights of the
tested document's words are calculated to determine the document keywords which will be compared with
the keywords of the corpus categorizes to determine the tested document's best category.
The effect of training set size in authorship attribution: application on sho...IJECEIAES
Authorship attribution (AA) is a subfield of linguistics analysis, aiming to identify the original author among a set of candidate authors. Several research papers were published and several methods and models were developed for many languages. However, the number of related works for Arabic is limited. Moreover, investigating the impact of short words length and training set size is not well addressed. To the best of our knowledge, no published works or researches, in this direction or even in other languages, are available. Therefore, we propose to investigate this effect, taking into account different stylomatric combination. The Mahalanobis distance (MD), Linear Regression (LR), and Multilayer Perceptron (MP) are selected as AA classifiers. During the experiment, the training dataset size is increased and the accuracy of the classifiers is recorded. The results are quite interesting and show different classifiers behaviours. Combining word-based stylomatric features with n-grams provides the best accuracy reached in average 93%.
A novel method for arabic multi word term extractionijdms
Arabic Multiword Terms (AMWTs) are relevant strings of words in text documents. Once they are
automatically extracted, they can be used to increase the performance of any Arabic Text Mining
applications such as Categorization, Clustering, Information Retrieval System, Machine Translation, and
Summarization, etc. Mainly the proposed methods for AMWTs extraction can be categorized in three
approaches: Linguistic-based, Statistic-based, and hybrid-based approach. These methods present some
drawbacks that limit their use. In fact they can only deal with bi-grams terms and their yield not good
accuracies. In this paper, to overcome these drawbacks, we propose a new and efficient method for
AMWTs Extraction based on a hybrid approach. This latter is composed by two main filtering steps: the
Linguistic filter and the Statistical one. The Linguistic Filter uses our proposed Part Of Speech (POS)
Tagger and the Sequence identifier as patterns in order to extract candidate AMWTs. While the Statistical
filter incorporate the contextual information, and a new proposed association measure based on Termhood
and Unithood Estimation named NTC-Value.
To evaluate and illustrate the efficiency of our proposed method for AMWTs extraction, a comparative
study has been conducted based on Kalimat Corpus and using nine experiment schemes: In the linguistic
filter, we used three POS Taggers such as Taani’s method based Rule-approach, HMM method based
Statistical-approach, and our recently proposed Tagger based Hybrid –approach. While in the Statistical
filter, we used three statistical measures such as C-Value, NC-Value, and our proposed NTC-Value. The
obtained results demonstrate the efficiency of our proposed method for AMWTs extraction: it outperforms
the other ones and can deal correctly with the tri-grams terms.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
An Automatic Question Paper Generation : Using Bloom's TaxonomyIRJET Journal
This document presents a system for automatically generating exam questions and categorizing them according to Bloom's Taxonomy. It uses natural language processing techniques like part-of-speech tagging to analyze questions and identify keywords and verbs. Rules are developed to match question patterns and keywords to the appropriate Bloom's Taxonomy category. A randomization algorithm is also introduced to randomly select questions from the database and avoid repetitions. The system aims to help educators automatically analyze exam questions and ensure a balance of cognitive levels according to Bloom's Taxonomy. Preliminary results found the rules could successfully categorize questions in the test set. The proposed system has applications for educational institutions, universities, and government exams.
A New Concept Extraction Method for Ontology Construction From Arabic TextCSCJournals
Ontology is one of the most popular representation model used for knowledge representation, sharing and reusing. The Arabic language has complex morphological, grammatical, and semantic aspects. Due to complexity of Arabic language, automatic Arabic terminology extraction is difficult. In addition, concept extraction from Arabic documents has been challenging research area, because, as opposed to term extraction, concept extraction are more domain related and more selective. In this paper, we present a new concept extraction method for Arabic ontology construction, which is the part of our ontology construction framework. A new method to extract domain relevant single and multi-word concepts in the domain has been proposed, implemented and evaluated. Our method combines linguistic, statistical information and domain knowledge. It first uses linguistic patterns based on POS tags to extract concept candidates, and then stop words filter is implemented to filter unwanted strings. To determine relevance of these candidates within the domain, different statistical measures and new domain relevance measure are implemented for first time for Arabic language. To enhance the performance of concept extraction, a domain knowledge will be integrated into the module. The concepts scores are calculated according to their statistical values and domain knowledge values. In order to evaluate the performance of the method, precision scores were calculated. The results show the high effectiveness of the proposed approach to extract concepts for Arabic ontology construction.
A Novel Text Classification Method Using Comprehensive Feature WeightTELKOMNIKA JOURNAL
Currently, since the categorical distribution of short text corpus is not balanced, it is difficult to
obtain accurate classification results for long text classification. To solve this problem, this paper proposes
a novel method of short text classification using comprehensive feature weights. This method takes into
account the situation of the samples in the positive and negative categories, as well as the category
correlation of words, so as to improve the existing feature weight calculation method and obtain a new
method of calculating the comprehensive feature weight. The experimental result shows that the proposed
method is significantly higher than other feature-weight methods in the micro and macro average value,
which shows that this method can greatly improve the accuracy and recall rate of short text classification.
The document discusses using convolutional neural networks (CNNs) for text classification. It presents two CNN architectures - a character-level CNN that takes raw text as input and a word-level CNN that uses word embeddings. The word-level CNN achieved 85% accuracy on a product categorization task and was faster to train and run than the character-level CNN or traditional SVMs. The document concludes that word-level CNNs are a promising approach for text classification that can achieve high accuracy with minimal tuning.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSijaia
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
The document provides an overview of a proposed text categorization system using a modified Naive Bayes algorithm. It includes sections on the problem definition, objectives, literature review, methodology, proposed system architecture consisting of modules for dataset preprocessing, text categorization using the modified algorithm, a comparative study. The system would use a 20 newsgroup dataset, perform text reduction during preprocessing, classify unknown text, and compare the performance of the existing and proposed algorithms. It lists the software and hardware requirements and provides references for related work.
New instances classification framework on Quran ontology applied to question ...TELKOMNIKA JOURNAL
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
The document describes a method for detecting hostile content on social media using task adaptive pretraining of transformer models.
Key points:
- The method uses pretrained IndicBERT models fine-tuned on Hindi tweets to generate embeddings for text, hashtags, and emojis, which are concatenated and passed through an MLP classifier.
- An additional stage of "task adaptive pretraining" further pretrains the text encoder on the task data prior to fine-tuning, which improves performance over directly fine-tuning.
- Evaluation on a Hindi hostile tweet detection dataset shows the task adaptive pretraining approach improves F1 scores for hostility detection and related subtasks compared to without additional pretraining.
This document discusses predicting prominent syllables in Malay language sentences using support vector machines (SVM). SVM was trained on 50 sentences with features like part of speech, syllable type, length and position. Radial basis function was used as the kernel. SVM achieved 88.7% accuracy in predicting prominent syllables, outperforming naive Bayes which achieved 88.3% accuracy. The results show that SVM is effective for this task of classifying prominent syllables in Malay language sentences.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET Journal
This document discusses multi-label document classification approaches using machine learning techniques. It first provides background on multi-label classification and surveys several existing methods. These include transductive multi-label learning via label set propagation, classifier chains for multi-label classification, and multilabel neural networks. It also discusses random k-label sets and graph-based substructure pattern mining techniques. The document then proposes a new multi-label classification system that uses machine learning algorithms with semi-supervised learning and assigns weights to labels during testing to classify instances. Finally, it concludes that existing methods have issues like high computational complexity and missing data that the proposed approach aims to address.
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
This document discusses a probabilistic approach to string transformation and describes four modules: 1) approximate string search, 2) candidate selection for spelling error correction, 3) candidate generation for spelling error correction, and 4) query reformulation. It presents a new statistical learning approach to string transformation and describes how this approach can be applied to spelling error correction of queries and query reformulation for web search.
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
This document discusses the work of the DCMI-IEEE LTSC Task Force to specify how to express Learning Object Metadata (LOM) instances using terms and constructs from the Dublin Core Metadata Initiative (DCMI) Abstract Model. The Task Force aims to develop a Recommended Practice that defines how LOM elements can be mapped to DCMI Abstract Model classes, properties and encoding schemes to allow LOM metadata to be combined with DCMI metadata in applications. The Task Force has produced initial analyses and examples and plans further work to create an application profile and transformation between LOM and the DCMI Abstract Model representation.
With technological advancements and increment in Mobile Phones supported content advertisement, because the use of SMS phones has increased to a big level to prompted Spam SMS unsolicited Messages to users, on the complexity of reports the quality of SMS Spam is expanding step by step. These spam messages can lead loss of personal data as well. SMS spam detection which is relatively equal to a replacement area and systematic literature review on this area is insufficient. SMS detection are often dealed using various machine learning techniques which as a feature called SMS spam filtering which separates spam or ham . This Paper aims to match treats spam detection as a basic two class document classification problem. The Classification will comprise of classification algorithm with extractions and different dataset collected which uses a classification feature to filter the messages . In this web journal, we are going center on creating a Naïve Bayes show for spam message identification, and utilize flash as it could be a web benefit advancement micro framework in python to form an API for show. The Comparison has performed using machine learning and different algorithm techniques. Kavya P | Dr. A. Rengarajan, "A Comparative Study for SMS Spam Detection" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-1 , December 2020, URL: https://www.ijtsrd.com/papers/ijtsrd38094.pdf Paper URL : https://www.ijtsrd.com/computer-science/other/38094/a-comparative-study-for-sms-spam-detection/kavya-p
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching
(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression
scheme with different orders was used to perform the categorisation. We also applied different machine
learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the
algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning
algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
The document discusses using convolutional neural networks (CNNs) for text classification. It presents two CNN architectures - a character-level CNN that takes raw text as input and a word-level CNN that uses word embeddings. The word-level CNN achieved 85% accuracy on a product categorization task and was faster to train and run than the character-level CNN or traditional SVMs. The document concludes that word-level CNNs are a promising approach for text classification that can achieve high accuracy with minimal tuning.
A Survey of Arabic Text Classification Models IJECEIAES
There is a huge content of Arabic text available over online that requires an organization of these texts. As result, here are many applications of natural languages processing (NLP) that concerns with text organization. One of the is text classification (TC). TC helps to make dealing with unorganized text. However, it is easier to classify them into suitable class or labels. This paper is a survey of Arabic text classification. Also, it presents comparison among different methods in the classification of Arabic texts, where Arabic text is represented a complex text due to its vocabularies. Arabic language is one of the richest languages in the world, where it has many linguistic bases. The researche in Arabic language processing is very few compared to English. As a result, these problems represent challenges in the classification, and organization of specific Arabic text. Text classification (TC) helps to access the most documents, or information that has already classified into specific classes, or categories to one or more classes or categories. In addition, classification of documents facilitate search engine to decrease the amount of document to, and then to become easier to search and matching with queries.
CUSTOMER OPINIONS EVALUATION: A CASESTUDY ON ARABIC TWEETSijaia
This paper presents an automatic method for extracting, processing, and analysis of customer opinions on Arabic social media. We present a four-step approach for mining of Arabic tweets. First, Natural Language Processing (NLP) with different types of analyses had performed. Second, we present an automatic and expandable lexicon for Arabic adjectives. The initial lexicon is built using 1350 adjectives as seeds from processing of different datasets in Arabic language. The lexicon is automatically expanded by collecting synonyms and morphemes of each word through Arabic resources and google translate. Third, emotional analysis was considered by two different methods; Machine Learning (ML) and rulebased method. Finally, Feature Selection (FS) is also considered to enhance the mining results. The experimental results reveal that the proposed method outperforms counterpart ones with an improvement margin of up to 4% using F-Measure.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
A considerable interest has been given to Multiword Expression (MWEs) identification and treatment. The identification of MWEs affects the quality of results of different tasks heavily used in natural language processing (NLP) such as parsing and generation. Different
approaches for MWEs identification have been applied such as statistical methods which employed as an inexpensive and language independent way of finding co-occurrence patterns.Another approach relays on linguistic methods for identification, which employ information such as part of speech (POS) filters and lexical alignment between languages is also used and
produced more targeted candidate lists. This paper presents a framework for extracting Arabic
MWEs (nominal or verbal MWEs) for bi-gram using hybrid approach. The proposed approach starts with applying statistical method and then utilizes linguistic rules in order to enhance the results by extracting only patterns that match relevant language rule. The proposed hybrid
approach outperforms other traditional approaches.
The document provides an overview of a proposed text categorization system using a modified Naive Bayes algorithm. It includes sections on the problem definition, objectives, literature review, methodology, proposed system architecture consisting of modules for dataset preprocessing, text categorization using the modified algorithm, a comparative study. The system would use a 20 newsgroup dataset, perform text reduction during preprocessing, classify unknown text, and compare the performance of the existing and proposed algorithms. It lists the software and hardware requirements and provides references for related work.
New instances classification framework on Quran ontology applied to question ...TELKOMNIKA JOURNAL
Instances classification with the small dataset for Quran ontology is the current research problem which appears in Quran ontology development. The existing classification approach used machine learning: Backpropagation Neural Network. However, this method has a drawback; if the training set amount is small, then the classifier accuracy could decline. Unfortunately, Holy Quran has a small corpus. Based on this problem, our study aims to formulate new instances classification framework for small training corpus applied to semantic question answering system. As a result, the instances classification framework consists of several essential components: pre-processing, morphology analysis, semantic analysis, feature extraction, instances classification with Radial Basis Function Networks algorithm, and the transformation module. This algorithm is chosen since it robustness to noisy data and has an excellent achievement to handle small dataset. Furthermore, document processing module on question answering system is used to access instances classification result in Quran ontology.
Text Categorization Using Improved K Nearest Neighbor AlgorithmIJTET Journal
Abstract— Text categorization is the process of identifying and assigning predefined class to which a document belongs. A wide variety of algorithms are currently available to perform the text categorization. Among them, K-Nearest Neighbor text classifier is the most commonly used one. It is used to test the degree of similarity between documents and k training data, thereby determining the category of test documents. In this paper, an improved K-Nearest Neighbor algorithm for text categorization is proposed. In this method, the text is categorized into different classes based on K-Nearest Neighbor algorithm and constrained one-pass clustering, which provides an effective strategy for categorizing the text. This improves the efficiency of K-Nearest Neighbor algorithm by generating the classification model. The text classification using K-Nearest Neighbor algorithm has a wide variety of text mining applications.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The papers for publication in The International Journal of Engineering& Science are selected through rigorous peer reviews to ensure originality, timeliness, relevance, and readability.
Information retrieval (IR) system aims to retrieve
relevant documents to a user query where the query is a set of
keywords. Cross-language information retrieval (CLIR) is a
retrieval process in which the user fires queries in one language to
retrieve information from another language. The growing
requirement on the Internet for users to access information
expressed in language other than their own has led to Cross
Language Information Retrieval (CLIR) becoming established as
a major topic in IR.
The document describes a method for detecting hostile content on social media using task adaptive pretraining of transformer models.
Key points:
- The method uses pretrained IndicBERT models fine-tuned on Hindi tweets to generate embeddings for text, hashtags, and emojis, which are concatenated and passed through an MLP classifier.
- An additional stage of "task adaptive pretraining" further pretrains the text encoder on the task data prior to fine-tuning, which improves performance over directly fine-tuning.
- Evaluation on a Hindi hostile tweet detection dataset shows the task adaptive pretraining approach improves F1 scores for hostility detection and related subtasks compared to without additional pretraining.
This document discusses predicting prominent syllables in Malay language sentences using support vector machines (SVM). SVM was trained on 50 sentences with features like part of speech, syllable type, length and position. Radial basis function was used as the kernel. SVM achieved 88.7% accuracy in predicting prominent syllables, outperforming naive Bayes which achieved 88.3% accuracy. The results show that SVM is effective for this task of classifying prominent syllables in Malay language sentences.
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATIONIJDKP
The document presents a new hybrid stemming algorithm for Arabic text categorization. It combines three existing stemming approaches - root-based (Khoja stemmer), stem-based (Larkey stemmer), and statistical (N-gram). The hybrid approach first normalizes text, removes stop words and prefixes/suffixes. It then uses bigram similarity and the Dice measure to find valid roots by comparing word bigrams to a root dictionary. The algorithm is evaluated using naive Bayesian and SVM classifiers in an Arabic text categorization system, and is found to outperform the individual stemming methods. The proposed stemmer improves the performance of Arabic text categorization.
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
IRJET- Multi Label Document Classification Approach using Machine Learning Te...IRJET Journal
This document discusses multi-label document classification approaches using machine learning techniques. It first provides background on multi-label classification and surveys several existing methods. These include transductive multi-label learning via label set propagation, classifier chains for multi-label classification, and multilabel neural networks. It also discusses random k-label sets and graph-based substructure pattern mining techniques. The document then proposes a new multi-label classification system that uses machine learning algorithms with semi-supervised learning and assigns weights to labels during testing to classify instances. Finally, it concludes that existing methods have issues like high computational complexity and missing data that the proposed approach aims to address.
Proposed Method for String Transformation using Probablistic ApproachEditor IJMTER
This document discusses a probabilistic approach to string transformation and describes four modules: 1) approximate string search, 2) candidate selection for spelling error correction, 3) candidate generation for spelling error correction, and 4) query reformulation. It presents a new statistical learning approach to string transformation and describes how this approach can be applied to spelling error correction of queries and query reformulation for web search.
Indexing for Large DNA Database sequencesCSCJournals
Bioinformatics data consists of a huge amount of information due to the large number of sequences, the very high sequences lengths and the daily new additions. This data need to be efficiently accessed for many needs. What makes one DNA data item distinct from another is its DNA sequence. DNA sequence consists of a combination of four characters which are A, C, G, T and have different lengths. Use a suitable representation of DNA sequences, and a suitable index structure to hold this representation at main memory will lead to have efficient processing by accessing the DNA sequences through indexing, and will reduce number of disk I/O accesses. I/O operations needed at the end, to avoid false hits, we reduce the number of candidate DNA sequences that need to be checked by pruning, so no need to search the whole database. We need to have a suitable index for searching DNA sequences efficiently, with suitable index size and searching time. The suitable selection of relation fields, where index is build upon has a big effect on index size and search time. Our experiments use the n-gram wavelet transformation upon one field and multi-fields index structure under the relational DBMS environment. Results show the need to consider index size and search time while using indexing carefully. Increasing window size decreases the amount of I/O reference. The use of a single field and multiple fields indexing is highly affected by window size value. Increasing window size value lead to better searching time with special type index using single filed indexing. While the search time is almost good and the same with most index types when using multiple field indexing. Storage space needed for RDMS indexing types are almost the same or greater than the actual data.
This document discusses the work of the DCMI-IEEE LTSC Task Force to specify how to express Learning Object Metadata (LOM) instances using terms and constructs from the Dublin Core Metadata Initiative (DCMI) Abstract Model. The Task Force aims to develop a Recommended Practice that defines how LOM elements can be mapped to DCMI Abstract Model classes, properties and encoding schemes to allow LOM metadata to be combined with DCMI metadata in applications. The Task Force has produced initial analyses and examples and plans further work to create an application profile and transformation between LOM and the DCMI Abstract Model representation.
With technological advancements and increment in Mobile Phones supported content advertisement, because the use of SMS phones has increased to a big level to prompted Spam SMS unsolicited Messages to users, on the complexity of reports the quality of SMS Spam is expanding step by step. These spam messages can lead loss of personal data as well. SMS spam detection which is relatively equal to a replacement area and systematic literature review on this area is insufficient. SMS detection are often dealed using various machine learning techniques which as a feature called SMS spam filtering which separates spam or ham . This Paper aims to match treats spam detection as a basic two class document classification problem. The Classification will comprise of classification algorithm with extractions and different dataset collected which uses a classification feature to filter the messages . In this web journal, we are going center on creating a Naïve Bayes show for spam message identification, and utilize flash as it could be a web benefit advancement micro framework in python to form an API for show. The Comparison has performed using machine learning and different algorithm techniques. Kavya P | Dr. A. Rengarajan, "A Comparative Study for SMS Spam Detection" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-1 , December 2020, URL: https://www.ijtsrd.com/papers/ijtsrd38094.pdf Paper URL : https://www.ijtsrd.com/computer-science/other/38094/a-comparative-study-for-sms-spam-detection/kavya-p
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching (PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and authorship respectively.
In this paper we present gender and authorship categorisationusing the Prediction by Partial Matching
(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression
scheme with different orders was used to perform the categorisation. We also applied different machine
learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the
algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning
algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
GENDER AND AUTHORSHIP CATEGORISATION OF ARABIC TEXT FROM TWITTER USING PPMijcsit
In this paper we present gender and authorship categorisation using the Prediction by Partial Matching(PPM) compression scheme for text from Twitter written in Arabic. The PPMD variant of the compression scheme with different orders was used to perform the categorisation. We also applied different machine learning algorithms such as Multinational Naïve Bayes (MNB), K-Nearest Neighbours (KNN), and an
implementation of Support Vector Machine (LIBSVM), applying the same processing steps for all the algorithms. PPMD shows significantly better accuracy in comparison to all the other machine learning algorithms, with order 11 PPMD working best, achieving 90 % and 96% accuracy for gender and
authorship respectively.
Review of Various Text Categorization Methodsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document reviews various text categorization methods and proposes a new supervised term weighting method using normalized term frequency and relevant frequency (ntf.rf). It begins by discussing existing text categorization methods and their limitations. Specifically, existing methods often require labeled training data, cleaned datasets, and work best on linearly separable data. The document then proposes the new ntf.rf method to address these limitations by incorporating preprocessing and leveraging both normalized term frequency and relevant frequency to assign term weights. Finally, the document outlines how ntf.rf could improve text categorization by providing a more effective term weighting approach.
An efficient-classification-model-for-unstructured-text-documentSaleihGero
The document presents a classification model for unstructured text documents that aims to support both generality and efficiency. The model follows the logical sequence of text classification steps and proposes a combination of techniques for each step. Specifically, it uses multinomial naive bayes classification with term frequency-inverse document frequency (TF-IDF) representation. The model is tested on the 20-Newsgroups dataset and results show improved performance over precision, recall, and f-score compared to other models.
This document presents a study on detecting offensive language on social media using text classification. The authors propose a modular pipeline consisting of data cleaning, tokenization, three embedding methods (TF-IDF, Word2Vec, FastText), and eight classifiers. Experiments on a Twitter dataset show the AdaBoost, SVM and MLP classifiers with TF-IDF embedding achieve the highest average F1-score of 95%. The Naive Bayes classifier generally performs the worst. Hyperparameter tuning further improves results for some models. The study contributes a modular framework for benchmarking text classification approaches for offensive language detection.
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...IJERA Editor
Assigning documents to related categories is critical task which is used for effective document retrieval. Automatic text classification is the process of assigning new text document to the predefined categories based on its content. In this paper, we implemented and performed comparison of Naïve Bayes and Centroid-based algorithms for effective document categorization of English language text. In Centroid Based algorithm, we used Arithmetical Average Centroid (AAC) and Cumuli Geometric Centroid (CGC) methods to calculate centroid of each class. Experiment is performed on R-52 dataset of Reuters-21578 corpus. Micro Average F1 measure is used to evaluate the performance of classifiers. Experimental results show that Micro Average F1 value for NB is greatest among all followed by Micro Average F1 value of CGC which is greater than Micro Average F1 of AAC. All these results are valuable for future research
Tweet Summarization and Segmentation: A Surveyvivatechijri
The document summarizes various techniques for tweet summarization and segmentation, and discusses how the particle swarm optimization (PSO) algorithm can be used for tweet clustering. It provides an overview of 8 papers that propose different approaches for tweet summarization using techniques like graph clustering, keyword graphs, and segmentation. It also analyzes these techniques and compares the performance of PSO to other clustering algorithms like K-means. The analysis shows that methods combining K-means and PSO have higher accuracy than K-means alone, and that PSO performs better than K-means for real-time tweet clustering, achieving an accuracy of 90.6% versus 62% for K-means.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
An in-depth exploration of Bangla blog post classificationjournalBEEI
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.
A systematic review of text classification research based on deep learning mo...IJECEIAES
Classifying or categorizing texts is the process by which documents are classified into groups by subject, title, author, etc. This paper undertakes a systematic review of the latest research in the field of the classification of Arabic texts. Several machine learning techniques can be used for text classification, but we have focused only on the recent trend of neural network algorithms. In this paper, the concept of classifying texts and classification processes are reviewed. Deep learning techniques in classification and its type are discussed in this paper as well. Neural networks of various types, namely, RNN, CNN, FFNN, and LSTM, are identified as the subject of study. Through systematic study, 12 research papers related to the field of the classification of Arabic texts using neural networks are obtained: for each paper the methodology for each type of neural network and the accuracy ration for each type is determined. The evaluation criteria used in the algorithms of different neural network types and how they play a large role in the highly accurate classification of Arabic texts are discussed. Our results provide some findings regarding how deep learning models can be used to improve text classification research in Arabic language.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
Machine learning for text document classification-efficient classification ap...IAESIJAI
Numerous alternative methods for text classification have been created because of the increase in the amount of online text information available. The cosine similarity classifier is the most extensively utilized simple and efficient approach. It improves text classification performance. It is combined with estimated values provided by conventional classifiers such as Multinomial Naive Bayesian (MNB). Consequently, combining the similarity between a test document and a category with the estimated value for the category enhances the performance of the classifier. This approach provides a text document categorization method that is both efficient and effective. In addition, methods for determining the proper relationship between a set of words in a document and its document categorization is also obtained.
A Survey Of Various Machine Learning Techniques For Text ClassificationJoshua Gorinson
This document discusses and compares machine learning techniques for text classification, specifically Naive Bayes, Support Vector Machines (SVM), and Decision Trees. It finds that SVM generally provides higher accuracy than the other techniques. The document provides an overview of each technique and evaluates them on text classification problems. It determines that while Naive Bayes and SVM are both efficient for large datasets, SVM tends to outperform Naive Bayes and is faster to train.
ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECHIJCI JOURNAL
This document presents a study that compiled an abbreviation dictionary to normalize abbreviations used in hate speech tweets in English. The study:
1) Used a Python library and keywords from an annotated hate speech dataset to collect over 24,000 tweets containing abbreviations.
2) Manually reviewed the abbreviations from the tweets according to developed rules to determine their complete forms.
3) Compiled a dictionary of 300 abbreviations and their full forms to help normalize abbreviations in Twitter hate speech detection.
The study compared its methodology and results to previous work on abbreviation dictionaries and found it extracted more tweets and abbreviations than similar prior studies. The dictionary will be used to normalize hate speech data in future research.
The document discusses text classification techniques used in search engines. It proposes using a combination of Naive Bayes classification, Random Forest, and Support Vector Machines, along with cross-validation and Schwarz Criterion, to select the best performing model for classifying user search intentions. The goal is to provide relevant organic and sponsored search results by determining the topics users are interested in through their queries. A dataset of 2.7 billion queries will be used to train and test the models.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/
Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit.
The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers.
Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on integration of Salesforce with Bonterra Impact Management.
Interested in deploying an integration with Salesforce for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
2. 84 Computer Science & Information Technology (CS & IT)
of predefined categories [6]. Several Text Categorization Systems have been conducted for
English and other European languages,
yet very little researches have been done out for the
Arabic Text Categorization [7]. Arabic language is a highly inflected language and it requires a
set of pre-processing to be manipulated, it is a Semitic language that has a very complex
morphology compared with English. In the process of Text Categorization the document must
pass through a series of steps (Figure.1): transformation the different types of documents into brut
text, removed the stop words which are considered irrelevant words (prepositions
and finally all words must be stemmed. Stemming is the process consists to extract the root from
the word by removing the affixes [8] [9] [10] [11] [12] [13] [14]. To represent the internal of each
document, the document must passed by the
process consists of three phases [15]:
a) All the terms appear in the documents corpus has been stocked in the super vector.
b) Term selection is a kind of dimensionality reduction, it aims at proposing a new set
in the super vector to some criteria [16] [17] [18];
c) Term weighting in which, for each term selected in phase (b) and for every document, a
weight is calculated by TF-IDF which combine the definitions of term frequency and inverse
document frequency [19].
Finally, the classifier is built by learning the characteristics of each category from a training set of
documents. After building of classifier, its effectiveness of is tested by applying it to the test set
and verifies the degree of corresponden
corpus.
Not that, one of the major problems in Text Categorization is the document’s representation
where we still limited only by the terms or words that occur in the document. In our work, we
believe that Arabic Tweets (which are Short Text Messages) representation is challenge and
crucial stage. It may impact positively or negatively on the accuracy of any Tweets
Categorization system, and therefore the improvement of the representation step will
necessity to the improvement of any Text Categorization system very greatly.
Figure .1 Architecture of TC System
ared and particles);
indexing process after pre-processing. Indexing
correspondence between the obtained results and those encoded in the
eve of terms
ce lead by
3. Computer Science & Information Technology (CS & IT) 85
To overcome this problem, in this paper we propose a system for Tweets Categorization based on
Rough Set Theory (RST) [20] [21]. This latter is a mathematical tool to deal with vagueness and
uncertainty. RST has been introduced by Pawlak in the early 1980s [20], it has been integrated in
many Text mining applications such as for features selection [], in this work we proposed to use
the Upper Approximation based RST to enrich the Tweet’s Representation by using other terms
in the corpus with which there is semantic links; it has been successful in many applications. In
this theory each set in a universe is described by a pair of ordinary sets called Lower and Upper
Approximations, determined by an equivalence relation in the Universe [20].
The remainder parts of this paper are organized as follows: we begin with a brief review on
related work in Arabic Tweets Categorization in the next section. Section III presents introduction
of the Rough Set Theory and his Tolerance Model; section IV presents two machine learning
algorithms for Text Categorization (TC): Naïve Bayesian and Support Vector Machine classifiers
used in our system; section V describes our proposed system for Arabic Tweets Categorization;
section VI conducts the experiments results; finally, section VII concludes this paper and presents
future work and some perspectives.
2. RELATED WORK
A number of recent papers have addressed the categorization of tweets most of them were tested
against English Text [4] [30] [31]. Furthermore Categorization Systems that address Arabic
Tweets are very rare in the literature [1]. This latter work realized by Rehab Nasser et al. presents
a roadmap for understanding Arabic Tweets through two main objectives. The first is to predict
tweet popularity in the Arab world. The second one is to analyze the use of Arabic proverbs in
Tweets, The Arabic proverbs classification model was labeled "Category" with four class values
sport, religious, political, and ideational.
On the other hand a wide range of Text Categorization based Rough Set Theory have been
developed most of them were tested against English Text [39] [40]. Concerning Text
Categorization Systems based on Rough Set that address Arabic Text is rare in the literature [41].
In Arabic Text Categorization we found Sawaf presented in [32] uses statistical methods such as
maximum entropy to cluster Arabic news articles; the results derived by these methods were
promising without morphological analysis. In [33], NB was applied to classify Arabic web data;
the results showed that the average accuracy was 68.78%. The work of Duwairi [34] describes a
distance-based classifier for Arabic text categorization. In [35] Laila et al. compared between
Manhattan distance and Dice measures using N-gram frequency statistical technique against
Arabic data sets collected from several online Arabic newspaper websites. The results showed
that N-gram using Dice measure outperformed Manhattan distance.
Mesleh et al. [36] used three classification algorithms, namely SVM, KNN and NB, to classify
1445 texts taken from online Arabic newspaper archives. The compiled text Automated Arabic
Text Categorization Using SVM and NB 125 were classified into nine classes: Computer,
Economics, Education, Engineering, Law, Medicine, Politics, Religion and Sports. Chi Square
statistics was used for features selection. [36] Discussed that "Compared to other classification
methods, their system shows a high classification effectiveness for Arabic data set in terms of F
measure (F=88.11)".
Thabtah et al. [37] investigate NB algorithm based on Chi Square feature selection method. The
experimental results compared against different Arabic text categorization data sets provided
evidence that features selection often increases classification accuracy by removing rare terms. In
[38] NB and KNN were applied to classify Arabic text collected from online Arabic newspapers.
4. 86 Computer Science & Information Technology (CS & IT)
The results show that the NB classifier outperformed KNN base on Cosine coefficient with
regards to macro F1, macro recall and macro precision measures.
Recently, Hadni et al. team [7] presents an Effective Arabic Stemmer Based Hybrid Approach for
Arabic Text Categorization.
Note that, in any Text Categorization system the center point is the document and its
representation that may impact positively or negatively on the accuracy of the system.
In the following section we present the Rough Set Theory, its mathematical background and also
the Tolerance Rough Set Model which is proposed to deal with Text Representation.
3. ROUGH SET THEORY
3.1. Rough Set Theory
In this section we present Rough Set Theory that has been originally developed as a tool for data
analysis and classification [20] [21]. It has been successfully applied in various tasks, such as
features selection/extraction, rule synthesis and classification. The central point of Rough Set
theory is the notion of set approximation: any set in U (a non-empty set of object called the
universe) can be approximated by its lower and upper approximation. In order to define lower and
upper approximation we need to introduce an indiscernibility relation that could be any
equivalence relation R (reflexive, symmetric, transitive). For two objects x, y U, if xRy then we
say that x and y are indiscernible from each other. The indiscernibility relation R induces a
complete partition of universe U into equivalent classes [x]R, x U [22].
We define lower and upper approximation of set X, with regards to an approximation space
denoted by A = (U, R), respectively as:
LR(X) = {x U: [x]R ⊆ X} (1)
UR(X) = {x U: [x]R X } (2)
Approximations can also be defined by mean of rough membership function. Given rough
membership function μX: U [0, 1] of a set X ⊆ U, the rough approximation is defined as:
LR(X) = {x U: μX(x, X) = 1} (3)
UR(X) = {x U: μX(x, X) 0} (4)
Note that, given rough membership function as:
μX(x, X) =
| [] ∩ |
| [] |
(5)
Rough Set Theory is dedicated to any data type but when it comes with Documents
Representation we use its Tolerance Model described in the next section.
3.2. Tolerance Rough Set Model
Let D= {d1, d2…, dn} be a set of document and T= {t1, t2…, tm} set of index terms for D. with the
adoption of the vector space model, each document di is represented by a weight vector {wi1,
wi2…, wim} where wij denotes the weight of index term j in document i. The tolerance space is
defined over a universe of all index terms U= T= {t1, t2…, tm} [23].
5. Computer Science Information Technology (CS IT) 87
Let fdi(ti) denotes the number of index terms ti in document di; fD(ti, tj) denotes the number of
documents in D in which both index terms ti an tj occurs. The uncertainty function I with regards
to threshold is defined as:
I = {tj | fD(ti, tj) } U {ti} (6)
Clearly, the above function satisfies conditions of being reflexive and symmetric. So I(Ii) is the
tolerance class of index term ti. Thus we can define the membership function μ for Ii T, X ⊆ T
as [24]:
μX(ti, X) = v(I(ti), X) =
|
7. |
(7)
Finally, the lower and the upper approximation of any document di ⊆ T can be determined as:
LR(di) = {ti T: v(I(ti), di) = 1} (8)
UR(di) = {ti T: v(I(ti), di) 0} (9)
Once the documents handling is finished, the results will be the entry of any Text Categorization
System. In the following section we present two of the most popular Machine Learning
algorithms, Naïve Bayesian and the Vector Machine.
4. BASED MACHINE LEARNING
TC is the task of automatically sorting a set of documents into categories from a predefined set.
This section covers two algorithms among the used known Machine Learning Algorithms for TC:
Naïve Bayesian (NB) and Support Vector Machine (SVM).
4.1. Naïve Bayesian Classifier
The NB is a simple probabilistic classifier based on applying Baye's theorem, and its powerful,
easy and language independent method. [25]
When the NB classifier is applied on the TC problem we use equation (10)
p(class | document)=
10. (10)
where:
P (class | document): It’s the probability that a given document D belongs to a given class C
P (document): The probability of a document, it's a constant that can be ignored
P (class): The probability of a class, it’s calculated from the number of documents in the category
divided by documents number in all categories
P (document | class): it’s the probability of document given class, and documents can be
represented by a set of words:
p(document | class) = # p
14. word# | class: The probability that a given word occurs in all documents of class C, and this
can be computed as follows:
p
15. word# |class =
(
) *
+ ) ,
(13)
where:
Tct: The number of times that the word occurs in that category C.
Nc: The number of words in category C.
V: The size of the vocabulary table.
λ: The positive constant, usually 1, or 0.5 to avoid zero probability.
4.2. Support Vector Machine Classifier
SVM introduced by Vapnik [26] and has been introduced in TC by Joachims [27]. Based on the
structural risk minimization principle from the computational learning theory, SVM seeks a
decision surface to separate the training data points into two classes and makes decisions based
on the support vectors that are selected as the only effective elements in the training set.
Given a set of N linearly separable points N = {xi Rn | i = 1, 2… N}, each point xi belongs to
one of the two classes, labeled as yi {-1, 1}. A separating hyper-plane divides S into 2 sides,
each side containing points with the same class label only. The separating hyper-plane can be
identified by the pair (w, b) that satisfies: w.x + b = 0
and:
w.xi + b +1 if yi = +1 (14)
w.xi + b -1 if yi = -1
For i = 1, 2… N; where the dot product operation (.) is defined by:
w.x = wi.xi
For vectors w and x, thus the goal of the SVM learning is to find the Optimal Separating Hyper
plane (OSH) that has the maximal margin to both sides. This can be formularized as:
minimize:
.
/
w.w
subject to
w.xi + b +1 if yi = +1 for i = 1, 2,…, N (15)
w.xi + b -1 if yi = -1
Figure 2 shows the optimal separating hyper-plane
16. Computer Science Information Technology (CS IT)
Figure .2 Learning Support Vector Classifier
89
The small crosses and circles represent positive and negative training examples, respectively,
whereas lines represent decision surfaces. Decision surface
among those shown, the best possible one, as it is the middle element of the widest set of parallel
decision surfaces (i.e., its minimum distance to any training example is maximum). Small boxes
indicate the Support Vectors.
During classification, SVM makes decision based on the OSH instead of the whole training set. It
simply finds out on which side of the OSH the test pattern is located. This property makes SVM
highly competitive, compared with other traditional pattern recognition meth
ods, computational efficiency and predictive accuracy [28].
The method described is applicable also to the case in which the positives and the negatives are
not linearly separable. Yang and Liu [28] experimentally compared the linear case (nam
the assumption is made that the categories are linearly separable) with the nonlinear case on a
standard benchmark, and obtained slightly better results in the former case.
5. THE PROPOSED SYSTEM
namely, when
ATEGORIZATION
In this section we present in detail our proposed system for Arabic Tweets Categorization.
The
proposed system The proposed system contains two mains components, the first component
generate the Upper Approximation for each Tweet to extend the Tweet’s Representation by
taking into account not just their words but also the words with which there is semantic links
(Figure 3); the second one does the categorization using the Naïve Bayesian and the Support
Vector Machine.
The treatment goes through the following steps: each
removing Arabic stop words, Latin words and special characters like (/, #, $, ect…). After that for
each word in corpus we make the following operations:
• Apply a stemmer algorithm to generate the root and eliminate
algorithms have been proposed in this topic such as Khoja [11] and Light Stemmer [13]
which are both the most known algorithms for Arabic text preprocessing; we used in this
stage Khoja as a stemmer algorithm.
• Calculate the frequency in the document and also in the hole corpus.
• Determine the tolerance class of terms which contains all the words that occur with our word
in the same document a number of times upper than
the formula (6).
i (indicated by the thicker line) is,
ification, methods, in terms of
YSTEM FOR ARABIC TWEETS CATEGORIZATION
aking Tweet in the corpus will be cleaned by
the redundancy. too many
. This tolerance class is defined using
17. 90 Computer Science Information Technology (CS IT)
After these operations we can calculate the approximations for each document by using the
formula (8) for the lower and the formula (9) for the upper approximation. Then term weighting
in which, for each term a weight is calculated by TF-IDF which combine the definitions of term
frequency and inverse document frequency.
Figure .3 Example of Rough Set Application
To illustrate the semantics links discovered by the Upper Approximation, Figure 3 presents an
example of Arabic Tweet after the pre-processing; the initial tweet contains the word show /
18. we saw that the word cinema / was added to the generated Upper Approximation because
the two words are semantically related and often found with each other.
Finally, the classifier is built by learning the characteristics of each category from a training set of
Tweets. After building of classifier, its effectiveness is tested by applying it to the test set and
verifies the degree of correspondence between the obtained results and those encoded in the
corpus.
6. EXPERIMENTS RESULTS
To illustrate that our proposed method can improve the Tweet’s representation by using the
Upper Approximation of RST and therefore can enhance the performance of our Arabic Tweets
Categorization System. In this section a series of experiments has been conducted.
19. Computer Science Information Technology (CS IT)
Figure .4 Descriptions of Our Experiments
Figure 4 describes our experiment with and without applying the Upper Approximation based
Rough Set Theory.
The data set used in our experiments is collected from Twitter by using
which is a freely Excel template that makes it super easier to collect
This corpus is manually classified into six categories (Table 1)
, News/ , Documentary/
. Table 2 contains some examples of Tweets collected from Twitter.
NodeXL Excel Template
Twitter network data [29].
1). These categories are: Cinema/
!# , Health/$% , Tourism/$ and Economics/
The dataset has been divided into two parts: training an
the documents per category. The testing data, on the other hand consist of 30% the documents of
each category.
and testing. The training data consist of 70%
Table 1. The he 6 categories and the number of Tweets in each one
Category
Cinema/
79
Documentary/
Economy/
63
$% 71
90
$ 83
'() 450
Health/
News/
Tourism/
Total/
Number of Tweets
!# 64
To assess the performance of the proposed system, a series of experiments has been conducted.
The effectiveness of our system
has been evaluated and compared in term of the F1
using the NB and the SVM classifiers used in our TC system.
91
. d F1-measure
20. 92 Computer Science Information Technology (CS IT)
F1-measure can be calculated using Recall and Precision measures as follow:
F1-measure =
/ ∗
21. 12## ∗ 3
12## ) 3
(16)
Precision and Recall both are defined as follows:
Precision =
(1
23. (1)4+
(18)
where:
• True Positive (TP) refers to the set of Documents which are correctly assigned to the given
category.
• False Positive (FP) refers to the set of documents which are incorrectly assigned to the
category.
• False Negative (FN) refers to the set of documents which are incorrectly not assigned to the
category.
Table 2. Examples of the Tweets in the corpus
Category Tweets in Arabic Tweets in English
Cinema/
# *() +,-. /
0 1
24. 23
$456 $7 8( # 9
Drowning in a world beset by
madness, murder and
supernatural powers
:57 ; $ % $
*(-= . ?!
25. The real story of the
assassination’s plot of the
President Richard Nixon
Documentary/
!#
31. (N ;=(O=
6 days left and the global
program the big universe
will be displayed on the
National Geographic Abu
Dhabi Channel
Economy/
+3.(J F6
L FQ RST (4U
$ 0(J B4 ب V=# W بXL $Y
F6
$.(N +N(3 F6
Z5O#
1
32. 4
The rise of gold to its highest
level, three and a half weeks
after the approval of
Washington to direct flights
on Iraq
[
44. 4.
Figure .5 Representation of the Precision’s Average
Figure .
Figure 7. Representation of the F1
A mortar shell fell led to the
injury of the governor of
Anbar by a seriously injuries
:
50. @O [4)
Resort of Chamonix is one of
the most famous resorts in
France
.6 Representation of the Recall’s Average
F1-measure’s Average
93
51. 94 Computer Science Information Technology (CS IT)
Figure 7 shows the obtained F1-measure results with and without using the Rough Set Theory in
our Arabic Tweets Categorization System. These results illustrate that using the Rough Set
Theory enhances greatly the performance of Arabic Tweets Categorization.
Using the SVM classifier, applying RST performed better results for the five classes: Cinema (86,
4), Documentary (91, 9), Economy (90, 98), News (87, 52) and Tourism (88, 51) compared to the
Categorization without using the RST: Cinema (85, 6), Documentary (84, 4), Economy (85, 35),
News (85, 45) and Tourism (85, 54). But an average F1-measure the 87, 13 % with using the RST
compared to 84, 39 % without using the RST.
To validate our results, we used another classifier which is the NB and the results was 85, 49 %
with applying the RST compared to 84 % without using the RST.
The precision and the recall results showed in Figure 5 respectively in Figure 6 illustrate also that
using the RST influence positively in the process of Arabic Tweets Categorization by enriching
the Tweet representation with others terms that not occurred in and they have some semantic
links with the Tweet’s terms.
7. CONCLUSION AND FUTURE WORK
Tweets Categorization becomes an interest topic in recent years especially for the Arabic
Language. Tweets Representation plays a vital role and may impact positively or negatively on
the performance of any Tweets Categorization System. In this paper, we have proposed an
effective method for Tweets Representation based on Rough Set Theory. This latter enriches and
adds other terms which are semantically related with the original terms existing in the original
Tweets. The proposed method has been integrated and tested for Arabic Tweets Categorization
using NB and SVM classifiers.
The obtained results show that using the Upper Approximation of the Rough Set Theory increases
significantly the F1-measure of the Tweets Categorization Systems.
In our future work, we will focus on using an external resource like Arabic WordNet or Arabic
Wikipedia to add more semantic links between terms in Tweets Representation step.
REFERENCES
[1] Rehab Nasser, Al-Wehaibi*, Muhammad Badruddin, Khan “Understanding the Content of Arabic
Tweets by Data and Text Mining Techniques”, Symposium on Data Mining and Applications
(SDMA2014)
[2] K. Lee, D. Palsetia, R. Narayanan, Md Patwary, A. Agrawal, A. Choudhary. “Twitter Trending Topic
Classification”, 11th IEEE International Conference on Data Mining Workshops 2011, 978-0-7695-
4409-0/11
[3] A. Go, R. Bhayani, L. Huang. “Twitter Sentiment Classification using Distant Supervision”,
Processing (2009), S. 1—6
[4] Bharath Sriram. “Short Text Classification In Twitter To Improve Information Filtering”. Computer
Science and Engineering. Ohio State University, 2010
[5] B. Sriram, D. Fuhry, E. Demir, H. Ferhatosmanoglu. “Short Text Classification in Twitter to Improve
Information Filtering”, SIGIR’10, July 19–23, 2010, Geneva, Switzerland. ACM 978-1 60558-896-
4/10/07.
[6] Sebastiani F. “Machine learning in automated text categorization”. ACM Computing Surveys, volume
34 number 1. PP 1-47. 2002.
[7] M.Hadni, A.Lachkar, S. Alaoui Ouatik “Effective Arabic Stemmer Based Hybrid Approach for Arabic
Text Categorization”, International Journal of Data Mining Knowledge Management Process
(IJDKP) Vol.3, No.4, July 2013
52. Computer Science Information Technology (CS IT) 95
[8] Al-Fedaghi S. and F. Al-Anzi. “A new algorithm to generate Arabic root-pattern forms”. In
proceedings of the 11th national Computer Conference and Exhibition. PP 391-400. March 1989.
[9] Al-Shalabi R. and M. Evens. “A computational morphology system for Arabic”. In Workshop on
Computational Approaches to Semitic Languages, COLING-ACL98. August 1998.
[10] Aljlayl M. and O. Frieder. “On Arabic search: improving the retrieval effectiveness via a light
stemming approach”. Proceedings of ACM CIKM 2002 International Conference on Information and
Knowledge Management, McLean, VA, USA, 2002, pp. 340-347.
[11] Chen A. and F. Gey. “Building an Arabic Stemmer for Information Retrieval”. In Proceedings of the
11th Text Retrieval Conference (TREC 2002), National Institute of Standards and Technology. 2002.
[12] Khoja S.. “Stemming Arabic Text”. Lancaster, U.K., Computing Department, Lancaster University.
1999.
[13] Larkey L. and M. E. Connell. “Arabic information retrieval at UMass in TREC-10”. Proceedings of
TREC 2001, Gaithersburg: NIST. 2001.
[14] Larkey L., L. Ballesteros, and M. E. Connell. “Improving Stemming for Arabic Information Retrieval:
Light Stemming and Co occurrence Analysis”. Proceedings of SIGIR’02. PP 275–282. 2002.
[15]Sebastiani F. “A Tutorial on Automated Text Categorisation”. Proceedings of ASAI-99, 1st
Argentinian Symposium on Artificial Intelligence. PP 7-35. 1999.
[16] Liu T., S. Liu, Z. Chen and Wei-Ying Ma. “An Evaluation on Feature Selection for Text Clustering”.
Proceedings of the 12th International Conference (ICML 2003), Washington, DC, USA. PP 488-495.
2003.
[17] Rogati M. and Y. Yang. “High-Performing Feature Selection for Text classification”. CIKM'02, ACM.
2002.
[18] Yang Y., and J. O. Pedersen. “A comparative study on feature selection in text categorization”.
Proceedings of ICML-97. PP 412-420. 1997.
[19] Aas K. and L. Eikvil. “Text categorisation: A survey”, Technical report, Norwegian Computing
Center. 1999.
[20] Pawlak, Z. Rough sets: Theoretical aspects of reasoning about data. Kluwer Dordrecht, 1991.
[21] Jan Komorowski, Lech Polkowski, Andrzej Skowron “Rough Sets: A Tutorial”
[22] Ngo Chi Lang “A tolerance rough set approach to clustering web search results”
[23] Jin Zhang and Shuxuan Chen “A study on clustering algorithm of Web search results based on rough
set”, Software Engineering and Service Science (ICSESS), 2013
[24] Ngo Chi Lang, “A tolerance rough set approach to clustering web search results”, Poland: Warsaw
University, 2003.
[25] Saleh Alsaleem, “Automated Arabic Text Categorization Using SVM and NB”, International Arab
Journal of e-Technology, Vol. 2, No.2, June 2011.
[26] Vapnik V. (1995). The Nature of Statistical Learning Theory, chapter 5. Springer-Verlag, New York.
[27] Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant
Features, In Proceedings of the European Conference on Machine Learning (ECML), 1998, pp.173-
142, Berlin
[28] Yang Y. and X. Liu, “A re-examination of text categorization methods,” in 22nd Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’99), pp. 42–
49, 1999.
[29]http://social-dynamics.org/twitter-network-data/
[30]B. Sriam, D. Fuhry, E. Demir, H. Ferhatosmanoglu, and M. Demirbas, “Short text classification in
twitter to improve information filtering,” in Proceeding of the 33rd international ACM SIGIR
conference on Research and development in information retrieval, 2010, pp. 841–842.
[31] D. Antenucci, G. Handy, A. Modi, M. Tinkerhess “Classification Of Tweets Via Clustering Of
Hashtags”, EECS 545 FINAL PROJECT, FALL, 2011
[32] Sawaf, H. Zaplo,J. and Ney. H. Statistical Classification Methods for Arabic News Articles. Arabic
Natural Language Processing, Workshop on the ACL,2001. Toulouse, France.
[33] El-Kourdi, M., Bensaid, A., and Rachidi, T.Automatic Arabic Document Categorization Based on the
Naïve Bayes Algorithm, 20th International Conference on Computational Linguistics, 2004, Geneva
[34] Duwairi R., “Machine Learning for Arabic Text Categorization,” Journal of the American Society for
Information Science and Technology (JASIST), vol. 57, no. 8, pp. 1005-1010, 2005.
[35] Laila K. “Arabic Text Classification Using N-Gram Frequency Statistics A Comparative Study,”
DMIN, 2006, pp. 78-82.
[36] Mesleh, A. A. Chi Square Feature Extraction Based Svms Arabic Language Text Categorization
System, Journal of Computer Science (3:6), 2007, pp. 430-435.
53. 96 Computer Science Information Technology (CS IT)
[37] Thabtah F., Eljinini M., Zamzeer M., Hadi W. (2009) Naïve Bayesian based on Chi Square to
Categorize Arabic Data. In proceedings of The 11th International Business Information Management
Association Conference (IBIMA) Conference on Innovation and Knowledge Management in Twin
Track Economies, Cairo, Egypt 4 - 6 January. (pp. 930-935).
[38] Hadi W., Thabtah F., ALHawari S., Ababneh J. (2008b) Naive Bayesian and K-Nearest Neighbour to
Categorize Arabic Text Data. Proceedings of the European Simulation and Modelling Conference. Le
Havre, France, (pp. 196-200), 2008
[39] Y. Li, S.C.K. Shiu, S.K. Pal, J.N.K. Liu, “A rough set-based case-based reasoned for text
categorization”, International Journal of Approximate Reasoning 41 (2006) 229–255, 2005
[40] W. Zhao, Z. Zhang, “An Email Classification Model Based on Rough Set Theory”, Active Media
Technology 2005, IEEE
[41] Yahia, M.E. “Arabic text categorization based on rough set classification”, Computer Systems and
Applications (AICCSA), 2011 IEEE