This document presents a study that compiled an abbreviation dictionary to normalize abbreviations used in hate speech tweets in English. The study:
1) Used a Python library and keywords from an annotated hate speech dataset to collect over 24,000 tweets containing abbreviations.
2) Manually reviewed the abbreviations from the tweets according to developed rules to determine their complete forms.
3) Compiled a dictionary of 300 abbreviations and their full forms to help normalize abbreviations in Twitter hate speech detection.
The study compared its methodology and results to previous work on abbreviation dictionaries and found it extracted more tweets and abbreviations than similar prior studies. The dictionary will be used to normalize hate speech data in future research.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...kevig
In the context of Indian languages, sarcasm detection in Hindi is a tedious job as it is rich in morphology and complex in structure. The annotated resources for sarcastic Hindi sentences are almost negligible for machine learning analysis. Here, we propose a pattern-based framework for sarcasm detection in Hindi tweets. It has been observed that a tweet is sarcastic if it contradicts its temporal facts intentionally. The temporal fact is a collection of time-dependent facts which may change over the period. We used Hindi news with timestamp as a corpus of temporal facts. The timestamp describes the fact period of any entity. In this research, a temporal fact is represented as a pair. To form a pair, one need to extract triplets i.e. subject, verb and object for every sentence. Next, a key is formed using subject and verb together. The value is formed using object and timestamp together. To predict the sarcastic tweet; one needs to extract the triplets from input tweet and form a pair. Now, the pair of the input tweet is mapped with related pair in the corpus of temporal facts and are checked if they coincide. If they contradict, the input tweet is considered as sarcastic. The achieved accuracy of the proposed approach outperforms the state-of-the-arts techniques for Hindi sarcasm detection as it attains an accuracy of 82.8%.
Diacritic Oriented Arabic Information Retrieval SystemCSCJournals
Arabic language support in search engines and operating systems is improved in recent years. Searching in the Internet is reliable and can be compared to the excellent support for several other languages, including English. However, for text with diacritics there are some limitations. For this reason, most Information retrieval (IR) systems remove diacritics from text and ignore it for its complexity. Searching text with diacritics is important for some kinds of documents, such as those of religious books, some newspapers and children stories. This research shows the design and development of the system that overcome the problem. The proposed system considers diacritics. The proposed system includes the design complexity in the retrieving algorithm rather than the information repository, which is database in this study. Also, this study analyses the results and the performance. Results are promising and performance analysis shows methods to enhance design and increase the performance. The proposed system can be integrated in search engines, text editors and any information retrieval system that include Arabic text. Performance analysis of the proposed system shows that this system is reliable. The proposed system is applied on database of Hadeeth, which is religious book includes the prophet action and statements. The system can be applied in any kind of data repository.
Car-Following Parameters by Means of Cellular Automata in the Case of EvacuationCSCJournals
This study is attention to the car-following model, an important part in the micro traffic flow. Different from Nagel–Schreckenberg’s studies in which car-following model without agent drivers and diligent ones, agent drivers and diligent ones are proposed in the car-following part in this work and lane-changing is also presented in the model. The impact of agent drivers and diligent ones under certain circumstances such as in the case of evacuation is considered. Based on simulation results, the relations between evacuation time and diligent drivers are obtained by using different amounts of agent drivers; comparison between previous (Nagel–Schreckenberg) and proposed model is also found in order to find the evacuation time. Besides, the effectiveness of reduction the evacuation time is presented for various agent drivers and diligent ones.
Dictionary based concept mining an application for turkishcsandit
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
DICTIONARY-BASED CONCEPT MINING: AN APPLICATION FOR TURKISHcscpconf
In this study, a dictionary-based method is used to extract expressive concepts from documents.
So far, there have been many studies concerning concept mining in English, but this area of
study for Turkish, an agglutinative language, is still immature. We used dictionary instead of
WordNet, a lexical database grouping words into synsets that is widely used for concept
extraction. The dictionaries are rarely used in the domain of concept mining, but taking into
account that dictionary entries have synonyms, hypernyms, hyponyms and other relationships in
their meaning texts, the success rate has been high for determining concepts. This concept
extraction method is implemented on documents, that are collected from different corpora.
Survey On Building A Database Driven Reverse DictionaryEditor IJMTER
Reverse dictionaries are widely used for a reference work that is organized by concepts,
phrases, or the definitions of words. This paper describe the many challenges inherent in building a
reverse lexicon, and map drawback to the well known abstract similarity problem The criterion web
search engines are basic versions of system; they take benefit of huge scale which permits inferring
general interest concerning documents from link information. This paper describe the basic study of
database driven reverse dictionary using three large-scale dataset namely person names, general English
words and biomedical concepts. This paper analyzes difficulties arising in the use of documents
produced by Reverse dictionary.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
A NOVEL APPROACH OF CLASSIFICATION TECHNIQUES FOR CLIRcscpconf
Recent and continuing advances in online information systems are creating many opportunities
and also new problems in information retrieval. Gathering the information in different natural
language is the most difficult task, which often requires huge resources. Cross-language
information retrieval (CLIR) is the retrieval of information for a query written in the native
language. This paper deals with various classification techniques that can be used for solving
the problems encountered in CLIR.
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...kevig
In the context of Indian languages, sarcasm detection in Hindi is a tedious job as it is rich in morphology and complex in structure. The annotated resources for sarcastic Hindi sentences are almost negligible for machine learning analysis. Here, we propose a pattern-based framework for sarcasm detection in Hindi tweets. It has been observed that a tweet is sarcastic if it contradicts its temporal facts intentionally. The temporal fact is a collection of time-dependent facts which may change over the period. We used Hindi news with timestamp as a corpus of temporal facts. The timestamp describes the fact period of any entity. In this research, a temporal fact is represented as a pair. To form a pair, one need to extract triplets i.e. subject, verb and object for every sentence. Next, a key is formed using subject and verb together. The value is formed using object and timestamp together. To predict the sarcastic tweet; one needs to extract the triplets from input tweet and form a pair. Now, the pair of the input tweet is mapped with related pair in the corpus of temporal facts and are checked if they coincide. If they contradict, the input tweet is considered as sarcastic. The achieved accuracy of the proposed approach outperforms the state-of-the-arts techniques for Hindi sarcasm detection as it attains an accuracy of 82.8%.
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...ijnlc
This document proposes a pattern-based framework for detecting sarcasm in Hindi tweets. It observes that a tweet is sarcastic if it contradicts its temporal facts intentionally. Temporal facts are collected from Hindi news articles with timestamps, represented as subject-verb-object triplets. To detect sarcasm, a tweet is mapped to related temporal facts; if they contradict, the tweet is sarcastic. Evaluating on 500 tweets, the approach achieves 82.8% accuracy, outperforming prior techniques for Hindi sarcasm detection.
In this paper, we propose a method that aligns comparable bilingual tweets which, not only
takes into account the specificity of a Tweet, but treats also proper names, dates and numbers in
two different languages. This permits to retrieve more relevant target tweets. The process of
matching proper names between Arabic and English is a difficult task, because these two
languages use different scripts. For that, we used an approach which projects the sounds of an
English proper name into Arabic and aligns it with the most appropriate proper name. We
evaluated the method with a classical measure and compared it to the one we developed. The
experiments have been achieved on two parallel corpora and shows that our measure
outperforms the baseline by 5.6% at R@1 recall.
KEYWORDS
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
This document evaluates the reliability of using Wikipedia as a source for cross-lingual query translation between English and Portuguese in the medical domain. Experiments were conducted translating single-word, two-word, and three-word queries between the two languages using Wikipedia links. Results showed Wikipedia coverage of single-word queries was around 81% for English and 80% for Portuguese. For query translation, coverage was 60% from English to Portuguese and 88% from Portuguese to English for single-word queries. Coverage decreased as query length increased. The study demonstrates Wikipedia has potential for cross-lingual information retrieval but coverage varies with query complexity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Twitter is a popular microblogging service where users create status messages (called
“tweets”). These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order. This format of presentation is useful to the user since the
latest tweets from are rich on recent news which is generally more interesting than tweets about
an event that occurred long time back. Merely, presenting tweets in a chronological order may
be too embarrassing to the user, especially if he has many followers. Therefore, there is a need
to separate the tweets into different categories and then present the categories to the user.
Nowadays Text Categorization (TC) becomes more significant especially for the Arabic
language which is one of the most complex languages.
In this paper, in order to improve the accuracy of tweets categorization a system based on
Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness
of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian
classifier and the Support Vector Machine classifier.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
A Twitter hashtag recommendation system that uses unsupervised learning to recommend relevant hashtags for a given tweet. Latent Dirichlet Allocation model is used to estimate topics from a corpora of more than 5 million tweets.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
Arabic Dataset for Automatic Keyphrase Extractioncscpconf
We propose a dataset in Arabic language for automatic keyphrase extraction algorithms. Our
Arabic dataset contains 400 documents along with their keyphrases. The dataset covers
eighteen different categories. An evaluation using a state-of-the-art algorithm demonstrates the
accuracy of our dataset is similar to that of English datasets.
ARABIC DATASET FOR AUTOMATIC KEYPHRASE EXTRACTIONcsandit
We propose a dataset in Arabic language for automatic keyphrase extraction algorithms. Our Arabic dataset contains 400 documents along with their keyphrases. The dataset covers eighteen different categories. An evaluation using a state-of-the-art algorithm demonstrates the accuracy of our dataset is similar to that of English datasets.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
TALASH: A SEMANTIC AND CONTEXT BASED OPTIMIZED HINDI SEARCH ENGINEIJCSEIT Journal
This document summarizes a research paper that proposes three models for query expansion in a Hindi search engine: 1) Using lexical resources like HindiWordNet to find synonyms and related terms, 2) Using user context information like location, interests and profession, 3) Combining lexical resources and user context. An experiment compares the precision of results from simple Google searches to searches using each model. Precision was highest using the combined Model III at 0.79, showing that integrating lexical and user context information improves search quality in Hindi.
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
This document describes a web-based spell checking service called "Allin Qillqay!" for the Quechua language. It utilizes existing open-source spell checking technologies for Quechua and integrates them into a web application using the CKEditor HTML editor and its Spell Check As You Type (SCAYT) plugin. The service allows for spell checking of Quechua texts within a web browser. It connects the CKEditor client-side interface with server-side spell checking programs for different Quechua varieties through an application programming interface. This provides an online spell checking resource for the Quechua language that improves through community use.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
More Related Content
Similar to ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...ijnlc
This document proposes a pattern-based framework for detecting sarcasm in Hindi tweets. It observes that a tweet is sarcastic if it contradicts its temporal facts intentionally. Temporal facts are collected from Hindi news articles with timestamps, represented as subject-verb-object triplets. To detect sarcasm, a tweet is mapped to related temporal facts; if they contradict, the tweet is sarcastic. Evaluating on 500 tweets, the approach achieves 82.8% accuracy, outperforming prior techniques for Hindi sarcasm detection.
In this paper, we propose a method that aligns comparable bilingual tweets which, not only
takes into account the specificity of a Tweet, but treats also proper names, dates and numbers in
two different languages. This permits to retrieve more relevant target tweets. The process of
matching proper names between Arabic and English is a difficult task, because these two
languages use different scripts. For that, we used an approach which projects the sounds of an
English proper name into Arabic and aligns it with the most appropriate proper name. We
evaluated the method with a classical measure and compared it to the one we developed. The
experiments have been achieved on two parallel corpora and shows that our measure
outperforms the baseline by 5.6% at R@1 recall.
KEYWORDS
MULTILINGUAL INFORMATION RETRIEVAL BASED ON KNOWLEDGE CREATION TECHNIQUESijcseit
As the information access across languages increases, the importance of a system that supports querybased
searching with the presence of multilingual also grows. Gathering the information in different
natural language is the most difficult task, which requires huge resources like database and digital
libraries. Cross language information retrieval (CLIR) enables to search in multilingual document
collections using the native language which can be supported by the different data mining techniques. This
paper deals with various data mining techniques that can be used for solving the problems encountered in
CLIR.
A SURVEY ON CROSS LANGUAGE INFORMATION RETRIEVALIJCI JOURNAL
Now a days, number of Web Users accessing information over Internet is increasing day by day. A huge
amount of information on Internet is available in different language that can be access by anybody at any
time. Information Retrieval (IR) deals with finding useful information from a large collection of
unstructured, structured and semi-structured data. Information Retrieval can be classified into different
classes such as monolingual information retrieval, cross language information retrieval and multilingual
information retrieval (MLIR) etc. In the current scenario, the diversity of information and language
barriers are the serious issues for communication and cultural exchange across the world. To solve such
barriers, cross language information retrieval (CLIR) system, are nowadays in strong demand. CLIR refers
to the information retrieval activities in which the query or documents may appear in different languages.
This paper takes an overview of the new application areas of CLIR and reviews the approaches used in the
process of CLIR research for query and document translation. Further, based on available literature, a
number of challenges and issues in CLIR have been identified and discussed.
This document evaluates the reliability of using Wikipedia as a source for cross-lingual query translation between English and Portuguese in the medical domain. Experiments were conducted translating single-word, two-word, and three-word queries between the two languages using Wikipedia links. Results showed Wikipedia coverage of single-word queries was around 81% for English and 80% for Portuguese. For query translation, coverage was 60% from English to Portuguese and 88% from Portuguese to English for single-word queries. Coverage decreased as query length increased. The study demonstrates Wikipedia has potential for cross-lingual information retrieval but coverage varies with query complexity.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Twitter is a popular microblogging service where users create status messages (called
“tweets”). These tweets sometimes express opinions about different topics; and are presented to
the user in a chronological order. This format of presentation is useful to the user since the
latest tweets from are rich on recent news which is generally more interesting than tweets about
an event that occurred long time back. Merely, presenting tweets in a chronological order may
be too embarrassing to the user, especially if he has many followers. Therefore, there is a need
to separate the tweets into different categories and then present the categories to the user.
Nowadays Text Categorization (TC) becomes more significant especially for the Arabic
language which is one of the most complex languages.
In this paper, in order to improve the accuracy of tweets categorization a system based on
Rough Set Theory is proposed for enrichment the document’s representation. The effectiveness
of our system was evaluated and compared in term of the F-measure of the Naïve Bayesian
classifier and the Support Vector Machine classifier.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language
pair. The machine translation system will take input script as English sentence and parse with the help of
Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the
machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will
take the parsed output and separate the source text word by word and searches for their corresponding
target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also
reordering rules are there. After applying the reordering rules, English sentence will be syntactically
reordered to suit Marathi language.
A Novel Approach for Rule Based Translation of English to Marathiaciijournal
This paper presents a design for rule-based machine translation system for English to Marathi language pair. The machine translation system will take input script as English sentence and parse with the help of Stanford parser. The Stanford parser will be used for main purposes on the source side processing, in the machine translation system. English to Marathi Bilingual dictionary is going to be created. The system will take the parsed output and separate the source text word by word and searches for their corresponding target words in the bilingual dictionary. The hand coded rules are written for Marathi inflections and also reordering rules are there. After applying the reordering rules, English sentence will be syntactically reordered to suit Marathi language
Latent Dirichlet Allocation as a Twitter Hashtag Recommendation SystemShailly Saxena
A Twitter hashtag recommendation system that uses unsupervised learning to recommend relevant hashtags for a given tweet. Latent Dirichlet Allocation model is used to estimate topics from a corpora of more than 5 million tweets.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
A Review on the Cross and Multilingual Information Retrievaldannyijwest
In this paper we explore some of the most important areas of information retrieval. In particular, Cross-
lingual Information Retrieval (CLIR) and Multilingual Information Retrieval (MLIR). CLIR deals with
asking questions in one language and retrieving documents in different language. MLIR deals with asking
questions in one or more languages and retrieving documents in one or more different languages. With an
increasingly globalized economy, the ability to find information in other languages is becoming a necessity.
We also presented the evaluation initiatives of information retrieval domain. Finally we have presented the
overall review of the research works in Indian and Foreign languages.
Arabic morphology complex is a core challenge of
Arabic information retrieval. This limitation makes
Arabic language is so complex environment of
information retrieval specialists due to the high
inflected in Arabic language. This paper explains the
importance of the stemming in information retrieval
through developed a method to search about information needs in Arabic text. The new developed
method based on light stemmer to increase matching
the keywords between the query, with the related
document in the test collection.
Arabic Dataset for Automatic Keyphrase Extractioncscpconf
We propose a dataset in Arabic language for automatic keyphrase extraction algorithms. Our
Arabic dataset contains 400 documents along with their keyphrases. The dataset covers
eighteen different categories. An evaluation using a state-of-the-art algorithm demonstrates the
accuracy of our dataset is similar to that of English datasets.
ARABIC DATASET FOR AUTOMATIC KEYPHRASE EXTRACTIONcsandit
We propose a dataset in Arabic language for automatic keyphrase extraction algorithms. Our Arabic dataset contains 400 documents along with their keyphrases. The dataset covers eighteen different categories. An evaluation using a state-of-the-art algorithm demonstrates the accuracy of our dataset is similar to that of English datasets.
RAPID INDUCTION OF MULTIPLE TAXONOMIES FOR ENHANCED FACETED TEXT BROWSINGijaia
In this paper we present and compare two methodologies for rapidly inducing multiple subject-specific
taxonomies from crawled data. The first method involves a sentence-level words co-occurrence frequency
method for building the taxonomy, while the second involves the bootstrapping of a Word2Vec based
algorithm with a directed crawler. We exploit the multilingual open-content directory of the World Wide
Web, DMOZ1
to seed the crawl, and the domain name to direct the crawl. This domain corpus is then input
to our algorithm that can automatically induce taxonomies. The induced taxonomies provide hierarchical
semantic dimensions for the purposes of faceted browsing. As part of an ongoing personal semantics
project, we applied the resulting taxonomies to personal social media data (Twitter, Gmail, Facebook,
Instagram, Flickr) with an objective of enhancing an individual’s exploration of their personal information
through faceted searching. We also perform a comprehensive corpus based evaluation of the algorithms
based on many datasets drawn from the fields of medicine (diseases) and leisure (hobbies) and show that
the induced taxonomies are of high quality.
TALASH: A SEMANTIC AND CONTEXT BASED OPTIMIZED HINDI SEARCH ENGINEIJCSEIT Journal
This document summarizes a research paper that proposes three models for query expansion in a Hindi search engine: 1) Using lexical resources like HindiWordNet to find synonyms and related terms, 2) Using user context information like location, interests and profession, 3) Combining lexical resources and user context. An experiment compares the precision of results from simple Google searches to searches using each model. Precision was highest using the combined Model III at 0.79, showing that integrating lexical and user context information improves search quality in Hindi.
Allin Qillqay A Free On-Line Web Spell Checking Service For QuechuaAndrea Porter
This document describes a web-based spell checking service called "Allin Qillqay!" for the Quechua language. It utilizes existing open-source spell checking technologies for Quechua and integrates them into a web application using the CKEditor HTML editor and its Spell Check As You Type (SCAYT) plugin. The service allows for spell checking of Quechua texts within a web browser. It connects the CKEditor client-side interface with server-side spell checking programs for different Quechua varieties through an application programming interface. This provides an online spell checking resource for the Quechua language that improves through community use.
LARQS: AN ANALOGICAL REASONING EVALUATION DATASET FOR LEGAL WORD EMBEDDINGkevig
Applying natural language processing-related algorithms is currently a popular project in legal
applications, for instance, document classification of legal documents, contract review and machine
translation. Using the above machine learning algorithms, all need to encode the words in the document in
the form of vectors. The word embedding model is a modern distributed word representation approach and
the most common unsupervised word encoding method. It facilitates subjecting other algorithms and
subsequently performing the downstream tasks of natural language processing vis-à-vis. The most common
and practical approach of accuracy evaluation with the word embedding model uses a benchmark set with
linguistic rules or the relationship between words to perform analogy reasoning via algebraic calculation.
This paper proposes establishing a 1,256 Legal Analogical Reasoning Questions Set (LARQS) from the
2,388 Chinese Codex corpus using five kinds of legal relations, which are then used to evaluate the
accuracy of the Chinese word embedding model. Moreover, we discovered that legal relations might be
ubiquitous in the word embedding model.
Similar to ABBREVIATION DICTIONARY FOR TWITTER HATE SPEECH (20)
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
1. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
David C. Wyld et al. (Eds): DBML, CITE, IOTBC, EDUPAN, NLPAI -2023
pp.261-269, 2023. IJCI – 2023 DOI:10.5121/ijci.2023.120219
ABBREVIATION DICTIONARY FOR TWITTER HATE
SPEECH
Zainab Mansur1
, Nazlia Omar2
and Sabrina Tiun3
1
Department of Computer Science, Faculty of Sciences, Omar Al-Mukhtar University,
Al Bayda, Libya
2
Center for AI Technology, Universiti Kebangsaan Malaysia, Bangi Malaysia
3
Center for AI Technology, Universiti Kebangsaan Malaysia, Bangi Malaysia
ABSTRACT
Informal methods of communication, like tweets, rely heavily on initialization abbreviations to reduce
message size and time, making them difficult to mine and normalize using existing methods. Therefore, this
present study compiled a lexicon repository to normalize the initialism abbreviations used in tweets in the
English language. Several components were taken into consideration while compiling the repository. This
included the Tweepy Python library, keyword list, small developed rules, and online dictionaries. A lexicon
repository of 300 abbreviations and their complete forms was compiled. This will be used in an ongoing
study to normalize Twitter hate speech data and to detect it.
KEYWORDS
Normalization, Hate Speech, Twitter, Abbreviation, Dictionary
1. INTRODUCTION
Social media users use informal language styles to communicate, making it difficult for natural
language processing (NLP) tools to process the data [1]. Furthermore, the abundant use of
abbreviations in informal writing poses a challenge in the text normalization process [2]. It is
common for social media users to abbreviate words, especially on Twitter, in lieu of full names
and words ([3]); [4]). Abbreviations are shortened versions of phrases or words, while an
acronym is a phrase or word that was shortened by combining its parts [5]. [6] used abbreviations
to normalize acronyms and abbreviations. Text mining algorithms produce more vector
dimensions when using Tweets that contain many slang or abbreviated words [7]. Subsequently,
extending all abbreviations and subclasses of these terms to their complete versions is necessary
[5] [8]. Text normalization has been used to enhance multiple social media NLP tasks [9].
However, the currently available normalization methods fail to resolve and correct the informal
abbreviation for tweets ([10]). An example of these abbreviations would be "RN" for “right now”
and "OMG" for “oh, my God”. Every language contains its own set of distinct abbreviations. As
such, social media platforms require dictionaries specific to each language [4]. This present study
constructed a lexicon repository to normalize initialization abbreviations on Twitter. The
compiled dictionary will be used to normalize Twitter hate speech detection data in an ongoing
study. The rest of this study is organized as follows: Section 2 addresses similar studies, Section
3 describes how to compile a dictionary, and Section 4 gives the findings. A summary of this
study and suggestions for future research are presented in Section 5.
2. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
262
2. LITERATURE REVIEW
A rigorous normalization process is required to manage the wealth of social media data and text
created online. Linguistic resources, such as abbreviation dictionaries, could facilitate the
normalization process. This section discusses studies similar to the objectives of this present
study, i.e., creating dictionary resources to normalize abbreviations in social media texts. [3]
built a colloquial dictionary of the Malay language using a context-aware system and dealt with
ambiguity using the dictionary technique, where a term may have many meanings. The entries in
the context dictionary were based on how frequently the times before and after the mistaken word
appeared. As the dictionary contained a limited number of entries and a simple schema, it was
constructed using the Python dictionary module.
[11] used the string replace method to expand abbreviations by first converting an abbreviation to
lowercase and then searching it on four online dictionaries. However, there needs to be more
normalization dictionaries that are (1) able to provide an adequate vocabulary that is linked to a
webpage, (2) include internet slang and abbreviations, and (3) are frequently updated. [4] looked
for free webpages on the Internet to extract sets of abbreviations. The meaning of each
abbreviation was then manually checked on the available online dictionaries. According to the
study, online English texts do not include social media terms. As such, slang term lexicons
should be updated by manually compiling entries for each language.
[4] used a list of Spanish slang words carefully selected by [12] from 21,000 tweets using
emotional hashtags. The Tweepy Python library was then used to develop a list of 200 slang
words and their meanings. The words and their context were then manually revised.
Meanwhile, [13] created a context dictionary of Indonesian abbreviations and their multiple
context-based meanings. Only words that appeared in complaint sentences were compiled.
However, more research is needed to enhance the capacity of the algorithm to analyze English
words and determine whether settings other than complaint data may be utilized. [14] developed
a method based on the dictionary-based search and longest common subsequences (LCS) for
Indonesian abbreviations and acronyms. However, the data collection process and the costs of
maintaining the dictionary-based algorithm was high. Furthermore, the LCS performed poorly as
many words have similar subsequent words. [7] developed a dataset via crowdsourcing to use
the preceding or subsequent word as the keyword to normalize abbreviated Indonesian
words. However, specific keywords are needed for reliable meaning identification.
Multiple studies have developed abbreviation dictionaries for social media text in different
languages and for various tasks. Nevertheless, there need to be more dictionaries
resources to normalize abbreviations for hate speech detection in Twitter texts. Hate
tweets, there are no distinguishable characteristics between them and non-hate tweets[15].
Accordingly, It is possible that some abbreviations are only used in hate speech and would
not be included in a list of non-hate speech abbreviations. Therefore, a separate list of hate
speech abbreviations might be required to detect hate speech accurately. As such, this
present study constructed a dictionary to help normalize the abbreviations in Twitter hate
speech datasets.
3. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
263
3. METHODOLOGY
Multiple components such as the Tweepy Python library, a list of keywords, a set of
rules, and online dictionaries were taken into consideration to compile such a resource.
These components are discussed in greater detail in the following subsection. Figure 1
depicts the methodology that was used to compile the abbreviation dictionary.
Figure 1. The method used to compile the abbreviation dictionary
A Twitter registration form was completed to create a new profile and gain access to the
consumer key and secret (Figure 2). According to Twitter, each request must be authenticated
through Open Authorisation (OAuth) (Figure 3). Most tweets were retrieved by using an open
Twitter search API.
Figure 2. The Twitter API form in the developer account
4. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
264
Figure 3. The API access for authentication
The Tweepy Python library (http://www.tweepy.org/) was used to collect the tweets, excluding
tweets written in languages other than English [16]. Abbreviations that appeared in hateful
contexts were compiled using the exact keywords that [17] used. [17] provide 16000 annotated
tweets that contain hate speech. Keywords similar to those used by [17] were used to collect data
from Twitter. As seen in Table 1, as hate speech language on Twitter was the priority. However,
a word may be hated for one but good for another. We collected tweets using the same hashtag of
the benchmark dataset, which ensures the contextual text had to be the same for both offensive
and inoffensive tweets. However, the annotation agreement of this dataset was 84%.
Approximately 24506 tweets were downloaded using the hashtags associated with those terms.
Figure 4 provides a sample of the tweets extracted from Twitter.
Table 1. A list of hashtags used in the [17] dataset
Figure 4. A sample of the downloaded tweets
Hashtag terms
“MKR”, “asian drive”, “femi-nazi”, “immigrant”, “nigger”, “sjw”,
“WomenAgainstFem- inism”, “blameonenotall”, “islam terrorism”,
“notallmen”, “victimcard”, “victim card”, “arab terror”, “gamergate”, “jsil”,
“racecard”, “race card”
5. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
265
The extracted tweets were saved in the CSV file format. The abbreviations were then manually
selected at random and checked across the online dictionaries listed in Table 2.
Table 2. Internet dictionaries that used to confirm the interpretation of the abbreviations
These dictionaries contain many complete versions of one abbreviation. Furthermore, the
meaning of an abbreviation can varies depending on its context [13]. As a result, a set of
guidelines was created to select the most acceptable finished version for a particular abbreviation,
where lexical variants appear in settings identical to their legal forms. [18]. Multiple studies have
developed rule-based methods for a wide range of languages and tasks that yield impressive
results [20, 22, 23]. As data analysis indicates that correlations can be detected based on context
[24], the data was analyzed and used to develop the following rules:
a. If an abbreviation appears more than twice in a similar context, it is included.
As an example, the following two tweets:” Bataao BC ab victim card khel raha hai when
he is the one who called a fellow Muslim Islamist despite knowing the f…
https://t.co/pL52SOUEtQ, and “RT @tariq22qamar: Bataao BC ab victim card khel raha
hai when he is the one who called a fellow Muslim Islamist despite knowing the fact
th…”.
b. If an abbreviation lacks context, it is excluded.
As an illustration, in the following two tweets: ” @A_Feranmi @madebycharles
@DavidIAdeleke Backend o, everything. All while trying to shade NB for not being
“trustwo… https://t.co/25SMXEP7K5”, and “@EazyH4 OMGGGG YOU A KING
FR FR 💯 #respect 🥺🥺🥺 #NotAllMen”
c. If an abbreviation appears in a different context, it is excluded.
As an example, these two different tweets: “@CM6982 @RagemanFC @Spadez86
@Nerdrotics Since when tf was BP political or SJW or whatever the fuck?”,
“@conservatyler But a lot of BP are astroturfed by elites who can shelter themselves
from consequences.”
It is significant to note that since we are interested in the context in which the abbreviations
appear, pre-processing task is optional here as it is done in related works.
4. RESULTS AND DISCUSSION
The abbreviations were randomly chosen from the extracted Twitter data and saved in a CSV file.
These words and their contexts were manually reviewed according to the developed rules. A list of
300 abbreviations and their complete forms were saved in a CSV file as the resulting dictionary
abbreviations. Figure 5 provides an excerpt of the complete form of the abbreviations.
Dictionary Name URL
Urban Dictionary https://www.urbandictionary.com
DICTIONARY.COM https://www.dictionary.com
ABBREVIATIONS https://www.abbreviations.com
NOSLANG.COM https://www.noslang.com
TheFreeDictionary https://www.thefreedictionary.com/
6. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
266
Figure 5. An excerpt of the initialism abbreviated words
However, a few existing works as [4] and [10], have collected English abbreviations from several
web sources, which are different from our data source. The results of this research were evaluated
with those of other comparable studies that built an abbreviation dictionary with data from
Twitter (Table 3). This present study was found to have extracted more tweet instances than
extant studies (Figure 6).
Table 3. A comparison with similar studies using twitter data
Reference Abbreviation
Count
Task Language
[3] Small in general Text normalization Malay
[19] 378 Text normalization Indonesian
[12] 200 Sentiment analysis Spanish
Proposed
abbreviation
dictionary
300 Hate speech
detection
English
7. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
267
Figure 6. A comparison of tweets extracted from similar extant studies
However, the created abbreviation list will be examined on how well it can handle the initialized
abbreviations when we develop the normalization model of hate speech text.On the other hand, a
dictionary for a new abbreviation term should be created. As a result, regularly updating with a
new abbreviation that cannot be found in our abbreviation list is necessary, which is available
upon request.
5. CONCLUSION
This present study constructed a Twitter lexicon repository that normalizes initialism
abbreviations. The dictionary was developed by using multiple strategies; such as collecting
tweets from the Tweepy Python library, a hate speech keyword, a few developed rules, and
online dictionary searches. The proposed method was then compared to similar methods that had
been used to build abbreviation dictionaries for Twitter. The suggested method was distinct from
those before in that it addressed the English language about hate speech. It also has more
abbreviations than similar methods, except for one method that extracted 78 more words. The
developed abbreviation dictionary will be used in a future study to normalize Twitter hate speech
detection datasets.
ACKNOWLEDGEMENTS
This work was supported by Universiti Kebangsaan Malaysia and the Ministry of Higher
Education (MOHE) under the grant FRGS/1/2020/ICT02/UKM/02/6 and TAP-K007009
REFERENCES
[1] T. Baldwin and Y. Li, “An in-depth analysis of the effect of text normalization in social media,”
NAACL HLT 2015 - 2015 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol.
Proc. Conf., no. January 2015, pp. 420–429, 2015, doi: 10.3115/v1/n15-1045.
[2] D. L. Pennell and Y. Liu, “Normalization of informal text,” Comput. Speech Lang., vol. 28, no. 1, pp.
256–277, 2014, doi: 10.1016/j.csl.2013.07.001.
[3] M. A. Saloot, N. Idris, and R. Mahmud, “An architecture for Malay Tweet normalization,” Inf.
Process. Manag., vol. 50, no. 5, pp. 621–633, 2014, doi: 10.1016/j.ipm.2014.04.009.
8. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
268
[4] H. Gómez-Adorno, I. Markov, G. Sidorov, J. P. Posadas-Durán, M. A. Sanchez-Perez, and L.
Chanona-Hernandez, “Improving Feature Representation Based on a Neural Network for Author
Profiling in Social Media Texts,” Comput. Intell. Neurosci., vol. 2016, 2016, doi:
10.1155/2016/1638936.
[5] S. Noor, A. Noor, and S. Tiun, “Rule-based Text Normalization for Malay Social Media Texts,” vol.
11, no. 10, 2020.
[6] Y. Wu, B. Tang, M. Jiang, S. Moon, and J. C. Denny, “Clinical Acronym / Abbreviation
Normalization using a,” 2013.
[7] D Sebastian, KA Nugraha, “Text Normalization for Indonesian Abbreviated Word Using
Crowdsourcing Method,” 2019 Int. Conf. Inf. Commun. Technol., pp. 529–532, 2019.
[8] A. M. Jaber, Word Sense Disambiguation for Clinical Abbreviations by Areej Mustafa Mahmoud
Jaber, no. February. 2022.
[9] M. Arora and V. Kansal, “Character level embedding with deep convolutional neural network for text
normalization of unstructured data for Twitter sentiment analysis,” Soc. Netw. Anal. Min., vol. 9, no.
1, p. 0, 2019, doi: 10.1007/s13278-019-0557-y.
[10] J. Khan and S. Lee, “applied sciences Enhancement of Text Analysis Using Context-Aware
Normalization of Social Media Informal Text,” 2021.
[11] P. Sosamphan, V. Liesaputra, S. Yongchareon, and M. Mohaghegh, “Evaluation of statistical text
normalisation techniques for twitter,” IC3K 2016 - Proc. 8th Int. Jt. Conf. Knowl. Discov. Knowl.
Eng. Knowl. Manag., vol. 1, no. January, pp. 413–418, 2016, doi: 10.5220/0006083004130418.
[12] G. Sidorov, M. I. Romero, I. Markov, R. Guzman-Cabrera, L. Chanona-Herńandez, and F. Vel
asquez, “Detecci on autoḿatica de similitud entre programas del lenguaje de programaci on Karel
basada en t ecnicas de procesamiento de lenguaje natural,” Comput. y Sist., vol. 20, no. 2, pp. 279–
288, 2016, doi: 10.13053/CyS-20-2-2369.
[13] N. Hanafiah, A. Kevin, C. Sutanto, Fiona, Y. Arifin, and J. Hartanto, “Text Normalization Algorithm
on Twitter in Complaint Category,” Procedia Comput. Sci., vol. 116, pp. 20–26, 2017, doi:
10.1016/j.procs.2017.10.004.
[14] D. Gunawan, Z. Saniyah, and A. Hizriadi, “ScienceDirect Normalization Normalization of of
Abbreviation Abbreviation and and Acronym Acronym on on Microtext Microtext in in Bahasa
Indonesia by Using and Longest Common Bahasa Indonesia by Using Dictionary-Based and Longest
Common Subsequence Subseq,” Procedia Comput. Sci., vol. 161, pp. 553–559, 2019, doi:
10.1016/j.procs.2019.11.155.
[15] Z. Zhang and L. Luo, “Hate speech detection: A solved problem? The challenging case of long tail on
Twitter,” Semant. Web, vol. 10, no. 5, pp. 925–945, 2018, doi: 10.3233/SW-180338.
[16] J. Roesslein, “Tweepy documentation!!!!!,” vol. 3.6.0, 2016.
[17] Z. Waseem and D. Hovy, “Hateful Symbols or Hateful People? Predictive Features for Hate Speech
Detection on Twitter,” pp. 88–93, 2016, doi: 10.18653/v1/n16-2013.
[18] B. Han, “Improving the Utility of Social Media with Natural Language Processing,” no. February,
2014.
[19] N. Hanafiah, A Kevin, C Sutanto, Y Arifin, “ScienceDirect ScienceDirect Text Normalization
Algorithm on Twitter in Complaint Category Text Normalization Algorithm on Twitter in Complaint
Category,” Procedia Comput. Sci., vol. 116, pp. 20–26, 2017, doi: 10.1016/j.procs.2017.10.004.
[20] U Nadia, N OMAR. “Malay Named Entity Recognition using Rule Based Approach”. Asia-Pacific
Journal of Information Technology and Multimedia. 2019;8(1):37-47.
[21] H Alshalabi, S Tiun, N Omar, E abdulwahab Anaam, Y Saif . BPR algorithm: New broken plural
rules for an Arabic stemmer. Egyptian Informatics Journal. 2022, Vol. 23 No. 3, 363–371.
[22] NA Halid , N Omar, Malay Part Of Speech Tagging Using Ruled-Based Approach. Jurnal Teknologi
Maklumat dan Multimedia Asia-Pasifik. 2017;6(2):91-107.
[23] S SAAD, K LATIFF U. Extraction of concept and concept relation for islamic term using syntactic
pattern approach. Asia-Pacific Journal of Information Technology and Multimedia. 2018, Vol. 7 No.
2, 71 – 84.
9. International Journal on Cybernetics & Informatics (IJCI) Vol. 12, No.2, April 2023
269
AUTHORS
ZAINAB MANSUR received the B.Sc. de- gree (Hons.) in computer science from Omar Al-Mukhtar
University, Libya, in 2003, and the M.S. degree in computer science from the Universiti Kebangsaan
Malaysia (UKM), Malaysia, in 2011. She is currently a researcher at UKM. Her research interest falls
under natural language processing, machine learning, text and Web mining, and sentiment analysis.
NAZLIA OMAR received the B.Sc. degree (Hons.) from UMIST, U.K, the M.Sc. degree
from the University of Liverpool, U.K., and the Ph.D. degree from the University of
Ulster, U.K. She is currently an Associate Professor at the Center for AI Technology,
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia. Her
main research interest is in the area of natural language processing and computational
linguistics.
SABRINA TIUN received the B.Sc. degree (Hons.) from Bradley University, USA, and the
master’s and Ph.D. degrees from Universiti Sains Malaysia. She is currently a Senior
Lecturer with the Faculty of Information Science and Technology (FTSM), Universiti
Kebangsaan Malaysia (UKM), Malaysia. Her research interests include natural language
processing, computational linguistics, speech processing, and social science computing.