This study investigates the feasibility of identifying gender via personal names written in the Devanagari
script. While extensive research exists for languages like English, Russian, and Brazilian, little attention
has been given to low-resource language like Nepali. This study addresses this gap, aiming to enhance
gender classification in Nepali names. We compare traditional machine learning algorithms along
with advanced contextualized transformer architectures like variants of BERT (Bidirectional Encoder
Representations from Transformers) specifically fine-tuned for detecting gender among individuals with
Nepali names. Our experiments aim at exploring how effectively these techniques capture linguistic
nuances unique to the Nepali language and assess their overall performance in terms of accuracy,
precision, recall, and F1 score metrics. Both experiments reveal superior performance by the BERT
variants DeBERTa and DistilBERT, demonstrating the merits of advanced NLP techniques for gender
classification in Nepali. The study also show that the additional encoding of name endings (last
character of a name) improve the performance of traditional ML algorithms by 10%. Through this
work, we contribute to gender classification methodologies tailored specifically for the linguistic nuances
of the Nepali language.
UNDERSTANDING PEOPLE TITLE PROPERTIES TO IMPROVE INFORMATION EXTRACTION PROCESScscpconf
In this paper, we introduce a new approach to tackle the process of extracting information
about people mentioned in the Arabic text. When a person name is mentioned in the Arabic text
usually it is combined with a title, in this paper the focus is on the properties of those titles. We
have identified six properties for each title with respect to gender, type, class, status, format,
and entity existence. We have studied each property, identified all attributes and values that
belong to each one of them and classified them accordingly. Sometimes person title is attached
to an entity; we have also identified some properties for these entities and we show how they
work in a harmony with person title properties. We use graphs for the implementation, nodes to
represents person title, person name, entity and their properties, where edges are used to
present inherited properties from parent nodes to child nodes.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEkevig
This paper contributes the first evaluation of neural
network models on NER task for Myanmar language. The experimental results show that those neural
sequence models can produce promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This
work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing
as well as to promote further researches on this understudied language.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
UNDERSTANDING PEOPLE TITLE PROPERTIES TO IMPROVE INFORMATION EXTRACTION PROCESScscpconf
In this paper, we introduce a new approach to tackle the process of extracting information
about people mentioned in the Arabic text. When a person name is mentioned in the Arabic text
usually it is combined with a title, in this paper the focus is on the properties of those titles. We
have identified six properties for each title with respect to gender, type, class, status, format,
and entity existence. We have studied each property, identified all attributes and values that
belong to each one of them and classified them accordingly. Sometimes person title is attached
to an entity; we have also identified some properties for these entities and we show how they
work in a harmony with person title properties. We use graphs for the implementation, nodes to
represents person title, person name, entity and their properties, where edges are used to
present inherited properties from parent nodes to child nodes.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEkevig
This paper contributes the first evaluation of neural
network models on NER task for Myanmar language. The experimental results show that those neural
sequence models can produce promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This
work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing
as well as to promote further researches on this understudied language.
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGEijnlc
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar natural language processing research work. In this work, NER for Myanmar language is treated as a sequence tagging problem and the effectiveness of deep neural networks on NER for Myanmar language has been investigated. Experiments are performed by applying deep neural network architectures on syllable level Myanmar contexts. Very first manually annotated NER corpus for Myanmar language is also constructed and proposed. In developing our in-house NER corpus, sentences from online news website and also sentences supported from ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian Language Treebank (ALT) project under ASEAN IVO. This paper contributes the first evaluation of neural network models on NER task for Myanmar language. The experimental results show that those neural sequence models can produce promising results compared to the baseline CRF model. Among those neural architectures, bidirectional LSTM network added CRF layer above gives the highest F-score value. This work also aims to discover the effectiveness of neural network approaches to Myanmar textual processing as well as to promote further researches on this understudied language.
Named Entity Recognition for Telugu Using Conditional Random FieldWaqas Tariq
Named Entity (NE) recognition is a task in which proper nouns and numerical information are extracted from documents and are classified into predefined categories such as Person names, Organization names , Location names, miscellaneous(Date and others). It is a key technology of Information Extraction, Question Answering system, Machine Translations, Information Retrial etc. This paper reports about the development of a NER system for Telugu using Conditional Random field (CRF). Though this state of the art machine learning technique has been widely applied to NER in several well-studied languages, the use of this technique to Telugu languages is very new. The system makes use of the different contextual information of the words along with the variety of features that are helpful in predicting the four different named entities (NE) classes, such as Person name, Location name, Organization name, miscellaneous (Date and others). Keywords: Named entity, Conditional Random field, NE, CRF, NER, named entity recognition
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITIONijnlc
Information Extraction (IE) is a sub discipline of Artificial Intelligence. IE identifies information in
unstructured information source that adheres to predefined semantics i.e. people, location etc. Recognition
of named entities (NEs) from computer readable natural language text is significant task of IE and natural
language processing (NLP). Named entity (NE) extraction is important step for processing unstructured
content. Unstructured data is computationally opaque. Computers require computationally transparent
data for processing. IE adds meaning to raw data so that it can be easily processed by computers. There
are various different approaches that are applied for extraction of entities from text. This paper elaborates
need of NE recognition for Marathi and discusses issues and challenges involved in NE recognition tasks
for Marathi language. It also explores various methods and techniques that are useful for creation of
learning resources and lexicons that are important for extraction of NEs from natural language
unstructured text.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans. In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
Distributed language representation has become the most widely used technique for language representation in various natural language processing tasks. Most of the natural language processing models that are based on deep learning techniques use already pre-trained distributed word representations, commonly called word embeddings. Determining the most qualitative word embeddings is of crucial importance for such models. However, selecting the appropriate word embeddings is a perplexing task since the projected embedding space is not intuitive to humans.In this paper, we explore different approaches for creating distributed word representations. We perform an intrinsic evaluation of several state-of-the-art word embedding methods. Their performance on capturing word similarities is analysed with existing benchmark datasets for word pairs similarities. The research in this paper conducts a correlation analysis between ground truth word similarities and similarities obtained by different word embedding methods.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...cscpconf
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the
attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing
sentiment analysis in Arabic language. Our approach had begun with special pre-processing
steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an
algorithm extracted sentiment-based feature selection method. This feature selection method
had utilized on three sentiment corpora using SVM classifier. We compared our approach with
some traditional methods, followed by most SA works. The experimental results were very
promising for enhancing SA accuracy.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
A Real-time Classroom Attendance System Utilizing Viola–Jones for Face Detect...Nischal Lal Shrestha
Abstract
The face of a human is crucial for conveying identity.
Computer scientists, Neuro
scientists, and psychologists, all exploits this human feature using image processing
techniques for commercial, and law enforcement applications. Likewise, this feature
can be invited into classrooms to maintain records of students’ attendance.
Con-
temporary traditional way of recording attendance involves human intervention and
requires cooperation of the students which is hectic and contribute towards waste of
class time.
An automated real-time classroom attendance system detects students
from still image or video frame coming from a digital camera, and marks his/her
attendance by recognizing them.
The system utilizes Viola–Jones object detection
framework which is capable of processing images extremely rapidly with high detec-
tion rates. In the next stage, the detected face in the image is recognized using Local
Binary Patterns Histogram.
Keywords– Computer
vision; face detection; face recognition; feature extraction;
image processing; Local Binary Patterns Histogram; object detection; Viola-Jones
object detection.
Mathematical modeling is the art of translating problems from an application area
into tractable mathematical formulations whose theoretical and numerical analysis
provides insight, answers, and guidance useful for the originating application.
Mathematical Modeling
• is indispensable in many applications.
• is successful in many further applications.
• gives precision and direction for problem solution.
• enables a thorough understanding of the system modeled.
More Related Content
Similar to Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Cross lingual similarity discrimination with translation characteristicsijaia
In cross-lingual plagiarism detection, the similarity between sentences is the basis of judgment. This paper
proposes a discriminative model trained on bilingual corpus to divide a set of sentences in target language
into two classes according their similarities to a given sentence in source language. Positive outputs of the
discriminative model are then ranked according to the similarity probabilities. The translation candidates
of the given sentence are finally selected from the top-n positive results. One of the problems in model
building is the extremely imbalanced training data, in which positive samples are the translations of the
target sentences, while negative samples or the non-translations are numerous or unknown. We train models
on four kinds of sampling sets with same translation characteristics and compare their performances.
Experiments on the open dataset of 1500 pairs of English Chinese sentences are evaluated by three metrics
with satisfying performances, much higher than the baseline system.
Myanmar named entity corpus and its use in syllable-based neural named entity...IJECEIAES
Myanmar language is a low-resource language and this is one of the main reasons why Myanmar Natural Language Processing lagged behind compared to other languages. Currently, there is no publicly available named entity corpus for Myanmar language. As part of this work, a very first manually annotated Named Entity tagged corpus for Myanmar language was developed and proposed to support the evaluation of named entity extraction. At present, our named entity corpus contains approximately 170,000 name entities and 60,000 sentences. This work also contributes the first evaluation of various deep neural network architectures on Myanmar Named Entity Recognition. Experimental results of the 10-fold cross validation revealed that syllable-based neural sequence models without additional feature engineering can give better results compared to baseline CRF model. This work also aims to discover the effectiveness of neural network approaches to textual processing for Myanmar language as well as to promote future research works on this understudied language.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSkevig
Text analysis has been attracting increasing attention in this data era. Selecting effective features from
datasets is a particular important part in text classification studies. Feature selection excludes irrelevant
features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy
and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and
comparing the available feature selection methods in general versatility regarding authorship attribution
problems and tries to identify which method is the most effective. The discussions on general versatility of
feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the
accuracy of SVM (support vector machine), and different criteria for determining the rank of feature
selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can
always extract useful information to discriminate the classes. The chi-square was proved to be a better
method overall.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
RBIPA: An Algorithm for Iterative Stemming of Tamil Language Textskevig
Cyberbullying is currently one of the most important research fields. The majority of researchers have contributed to research on bully text identification in English texts or comments, due to the scarcity of data; analyzing Tamil textstemming is frequently a tedious job. Tamil is a morphologically diverse and agglutinative language. The creation of a Tamil stemmer is not an easy undertaking. After examining the major difficulties encountered, proposed the rule-based iterative preprocessing algorithm (RBIPA). In this attempt, Tamil morphemes and lemmas were extracted using the suffix stripping technique and a supervised machine learning algorithm for classify the word based for pronouns and proper nouns. The novelty of proposed system is developing a preprocessing algorithm for iterative stemming; lemmatize process to discovering exact words from the Tamil Language comments. RBIPA shows 84.96% of accuracy in the given Test Dataset which hasa total of 13000 words.
A COMPARATIVE STUDY OF FEATURE SELECTION METHODSijnlc
Text analysis has been attracting increasing attention in this data era. Selecting effective features from datasets is a particular important part in text classification studies. Feature selection excludes irrelevant features from the classification task, reduces the dimensionality of a dataset, and improves the accuracy and performance of identification. So far, so many feature selection methods have been proposed, however,
it remains unclear which method is the most effective in practice. This article focuses on evaluating and comparing the available feature selection methods in general versatility regarding authorship attribution problems and tries to identify which method is the most effective. The discussions on general versatility of feature selection methods and its connection in selecting the appropriate features for varying data were
done. In addition, different languages, different types of features, different systems for calculating the accuracy of SVM (support vector machine), and different criteria for determining the rank of feature selection methods were used to measure the general versatility of these methods together. The analysis
results indicate the best feature selection method is different for each dataset; however, some methods can always extract useful information to discriminate the classes. The chi-square was proved to be a better method overall.
Enhancing the Performance of Sentiment Analysis Supervised Learning Using Sen...cscpconf
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the
attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing
sentiment analysis in Arabic language. Our approach had begun with special pre-processing
steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an
algorithm extracted sentiment-based feature selection method. This feature selection method
had utilized on three sentiment corpora using SVM classifier. We compared our approach with
some traditional methods, followed by most SA works. The experimental results were very
promising for enhancing SA accuracy.
ENHANCING THE PERFORMANCE OF SENTIMENT ANALYSIS SUPERVISED LEARNING USING SEN...csandit
Sentiment Analysis (SA) and machine learning techniques are collaborating to understand the attitude of text writer, implied in particular text. Although, SA is an important challenging
itself, it is very important challenging in Arabic language. In this paper, we are enhancing sentiment analysis in Arabic language. Our approach had begun with special pre processing steps. Then, we had adopted sentiment keywords co-occurrence measure (SKCM), as an algorithm extracted sentiment-based feature selection method. This feature selection method had utilized on three sentiment corpora using SVM classifier. We compared our approach with some traditional methods, followed by most SA works. The experimental results were very promising for enhancing SA accuracy.
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
Source and target word segmentation and alignment is a primary step in the statistical learning of a Transliteration. Here, we analyze the benefit of a syllable-like segmentation approach for learning a transliteration from English to an Indic language, which aligns the training set word pairs in terms of sub-syllable-like units instead of individual character units. While this has been found useful in the case of dealing with Out-of-vocabulary words in English-Chinese in the presence of multiple target dialects, we asked if this would be true for Indic languages which are simpler in their phonetic representation and pronunciation. We expected this syllable-like method to perform marginally better, but we found instead that even though our proposed approach improved the Top-1 accuracy, the individual-character-unit alignment model
somewhat outperformed our approach when the Top-10 results of the system were re-ranked using language modeling approaches. Our experiments were conducted for English to Telugu transliteration (our method will apply equally well to most written Indic languages); our training consisted of a syllable-like segmentation and alignment of a large training set, on which we built a statistical model by modifying a previous character-level maximum entropy based Transliteration learning system due to Kumaran and Kellner; our testing consisted of using the same segmentation of a test English word, followed by applying the model, and reranking the resulting top 10 Telugu words. We also report the dataset creation and selection since standard datasets are not available.
SEMI-SUPERVISED BOOTSTRAPPING APPROACH FOR NAMED ENTITY RECOGNITIONkevig
The aim of Named Entity Recognition (NER) is to identify references of named entities in unstructured documents, and to classify them into pre-defined semantic categories. NER often aids from added background knowledge in the form of gazetteers. However using such a collection does not deal with name variants and cannot resolve ambiguities associated in identifying the entities in context and associating them with predefined categories. We present a semi-supervised NER approach that starts with identifying named entities with a small set of training data. Using the identified named entities, the word and the context features are used to define the pattern. This pattern of each named entity category is used as a seed pattern to identify the named entities in the test set. Pattern scoring and tuple value score enables the generation of the new patterns to identify the named entity categories. We have evaluated the proposed system for English language with the dataset of tagged (IEER) and untagged (CoNLL 2003) named entity corpus and for Tamil language with the documents from the FIRE corpus and yield an average f-measure of 75% for both the languages.
STUDY OF NAMED ENTITY RECOGNITION FOR INDIAN LANGUAGESijistjournal
Named Entity Recognition is a prior task in Natural Language Processing. Named Entity Recognition is a sub task of information extraction and it identifies and classifies proper nouns in to its predefined categories such as person, location, organization, time, date etc. In this document the major focus is given on NER approaches and the work done till now for various languages to identify Named Entities is been discussed. Author have done comparative study to recognize named entity and identified that CRF approach proven best for Indian languages to identify named entity.
Named Entity Recognition System for Hindi Language: A Hybrid ApproachWaqas Tariq
Named Entity Recognition (NER) is a major early step in Natural Language Processing (NLP) tasks like machine translation, text to speech synthesis, natural language understanding etc. It seeks to classify words which represent names in text into predefined categories like location, person-name, organization, date, time etc. In this paper we have used a combination of machine learning and Rule based approaches to classify named entities. The paper introduces a hybrid approach for NER. We have experimented with Statistical approaches like Conditional Random Fields (CRF) & Maximum Entropy (MaxEnt) and Rule based approach based on the set of linguistic rules. Linguistic approach plays a vital role in overcoming the limitations of statistical models for morphologically rich language like Hindi. Also the system uses voting method to improve the performance of the NER system. Keywords: NER, MaxEnt, CRF, Rule base, Voting, Hybrid approach
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Similar to Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script (20)
A Real-time Classroom Attendance System Utilizing Viola–Jones for Face Detect...Nischal Lal Shrestha
Abstract
The face of a human is crucial for conveying identity.
Computer scientists, Neuro
scientists, and psychologists, all exploits this human feature using image processing
techniques for commercial, and law enforcement applications. Likewise, this feature
can be invited into classrooms to maintain records of students’ attendance.
Con-
temporary traditional way of recording attendance involves human intervention and
requires cooperation of the students which is hectic and contribute towards waste of
class time.
An automated real-time classroom attendance system detects students
from still image or video frame coming from a digital camera, and marks his/her
attendance by recognizing them.
The system utilizes Viola–Jones object detection
framework which is capable of processing images extremely rapidly with high detec-
tion rates. In the next stage, the detected face in the image is recognized using Local
Binary Patterns Histogram.
Keywords– Computer
vision; face detection; face recognition; feature extraction;
image processing; Local Binary Patterns Histogram; object detection; Viola-Jones
object detection.
Mathematical modeling is the art of translating problems from an application area
into tractable mathematical formulations whose theoretical and numerical analysis
provides insight, answers, and guidance useful for the originating application.
Mathematical Modeling
• is indispensable in many applications.
• is successful in many further applications.
• gives precision and direction for problem solution.
• enables a thorough understanding of the system modeled.
Approximating Value of pi(Π) using Monte Carlo Iterative MethodNischal Lal Shrestha
1
Introduction
Monte Carlo methods (or Monte Carlo experiments) are a broad class of computa-
tional algorithms that rely on repeated random sampling to obtain numerical results.
Their essential idea is using randomness to solve problems that might be deterministic
in principle. They are often used in physical and mathematical problems and are most
useful when it is difficult or impossible to use other approaches. Monte Carlo methods
are mainly used in three problem classes: optimization, numerical integration, and
generating draws from a probability distribution.
Monte Carlo methods vary, but tend to follow a particular pattern
• Define a domain of possible inputs
• Generate inputs randomly from a probability distribution over the domain
• Perform a deterministic computation on the inputs
• Aggregate the results
Mathematical Modeling is the art of translating problems from an application area into tractable mathematical formulations whose theoretical and numerical analysis provide insight, answers, and guidance for the the originating application.
A minor project proposal on
A Real-time Classic Chess Game
Submitted in partial fulfillment of the requirements for the degree
of Bachelor of Engineering in Software Engineering
under Pokhara University.
Submitted by:
Ashish Tiwari(https://aashishtiwari.com.np), 161709
Nischal Lal Shrestha(https://nischal.info.np), 161722
Poshan Pandey, 161724
Date:
August 06, 2018
Abstract
The purpose of Election Portal is to digitize election and election-related activities
in Nepal. Election Portal, therefore, offers a web application which stores candidate
information, election constituency details, latest election results, and various other
information directly related to election and enables the clients to fetch those data
efficiently. The problem statement relies on understanding the proper way to store
all those datasets in the database, solving the complexity involved within and finding
an appropriate interface to provide all these datasets to the visitors. Election Portal
also provides an opportunity to individual volunteers who are authorized to update
election result dataset of a particular constituency in real-time. Each web page of the
Election Portal is equipped with either a search form or a filter form to enhance user’s
ability to get information very quickly.
Keywords— Datasets, Django, Election, Portal, Web Application
https://nischal.info.np
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
1. Publication
doi
Preprint
doi
*correspondence: email@institution.edu
Comparative Analysis of ML and DL Models for Gender
Prediction in Devanagari Script
Preprint, compiled January 5, 2024
Nischal Lal Shrestha
Abstract
This study investigates the feasibility of identifying gender via personal names written in the Devanagari
script. While extensive research exists for languages like English, Russian, and Brazilian, little attention
has been given to low-resource language like Nepali. This study addresses this gap, aiming to enhance
gender classification in Nepali names. We compare traditional machine learning algorithms along
with advanced contextualized transformer architectures like variants of BERT (Bidirectional Encoder
Representations from Transformers) specifically fine-tuned for detecting gender among individuals with
Nepali names. Our experiments aim at exploring how effectively these techniques capture linguistic
nuances unique to the Nepali language and assess their overall performance in terms of accuracy,
precision, recall, and F1 score metrics. Both experiments reveal superior performance by the BERT
variants DeBERTa and DistilBERT, demonstrating the merits of advanced NLP techniques for gender
classification in Nepali. The study also show that the additional encoding of name endings (last
character of a name) improve the performance of traditional ML algorithms by 10%. Through this
work, we contribute to gender classification methodologies tailored specifically for the linguistic nuances
of the Nepali language.
1 Introduction
Gender detection has been a captivating research area
within AI. Detecting gender is a multifaceted process, often
involving innovative approaches using Natural Language
Processing (NLP) and Computer Vision (CV). Accurate
gender identification in domains like newspapers, social
sites, articles, etc provides valuable insights into existing
biases, representation, and diversity [1, 2]. Gender classifi-
cation is a complex task that requires careful consideration
of cultural, social and individual factors. Accurate gen-
der detection helps create targeted advertisements, gives
users personalized content, makes customer support faster,
and assists market researchers in understanding audience
segments. Inclusive and unbiased AI services require con-
tinuous improvements in reliable gender detection.
Many research has been conducted utilizing NLP [3, 4, 5],
CV [6, 7] for gender detection task. There are commercial
applications too which can predict gender based on the
name [8, 9, 10]. These commercial applications primarily
rely on dictionaries to predict the gender and are not freely
accessible for the community. Moreover, these commercial
applications do not have much information for low resources
language.
Although there has been many approach to detect the
gender using the name for languages like English, Russian,
Brazilian [3, 4, 5], very little or no has been research done
for low resource language like Nepali Name in Nepali script.
The unique linguistic characteristics and cultural nuances
of low-resource languages like Nepali remain largely unex-
plored in the context of gender prediction. In this paper,
we applied a range of machine learning and deep learning
algorithms to analyze 502,468 Nepali names written in the
Nepali script for gender detection. For the gender classi-
fication task of Nepali names, we trained traditional ML
algorithms, including Random Forest, K Neighbours Clas-
sifier, Decision Trees, Naive Bayes, Logistic Regression,
LDA, QDA, SVM, gradient boosting, and BERT-based
classifiers like DistilBERT and DeBERTa, on a dataset
comprising Nepali names written in the Nepali script, split
into an 80:20 train-test ratio.
2 Related Works
Computer Vision applications mainly use facial features
[11, 12, 13] to detect the gender of a person. The use
of Convolutional Neural Networks (CNNs) and other ad-
vanced algorithms allows for robust gender classification
based on facial attributes and patterns [14].
Identifying the gender of a person based on their name
is a common NLP task and can be accomplished using
techniques like gender pronoun usage, language models
and machine learning, gender specific name etc.
In [15] Goswami et al. conducted a gender detection exper-
iment on 9,660 gender-labeled blog posts from blogger.com.
Their model, incorporating features such as slang word
statistics and sentence length, achieved an impressive ac-
curacy of 89.3%.
Vashisth and Meehan [16] delved into gender inference using
NLP techniques, including bag of words, word embedding,
logistic regression, SVM, and Naive Bayes, using Twitter
data.
In the paper [4] the authors used a method utilizing three
types of features (word endings, character n-grams and
dictionary of names) combined within a linear supervised
model to detect the gender by full name for Russian lan-
guage. The paper shows that the proposed strategy is
highly successful as accuracy upto 96% is achieved.
2. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
2
Similarly, [5] implemented several ML algorithms to detect
gender of a name for Brazillian language. According to
[5] some models accurately predict gender in more than
95% of the cases. The paper also show that recurrent
models overcome the feedforward models in the binary
classification of gender.
In [3], Yifan Hu et. al. proposed a high accuracy character
level algorithm to detect the gender of a person using name.
The paper also show that using last name in addition to
the first name improve the performance of the model.
3 Dataset and Preprocessing
3.1 Nepali Names
Nepali names, unlike English names, present distinctive
challenges for machine learning algorithms due to their
unique linguistic and cultural attributes. For example,
when transliterating the English name "Albert Einstein"
into Nepali, it becomes एल्बर्ट आइन्स्टाइन. The Devanagari
script introduces different characters, illustrating the com-
plexities that machine learning algorithms must navigate
when handling Nepali names.
Furthermore, Nepali names often exhibit gender-specific
associations based on certain characters at the end. For
instance, names ending with "ा" (aa), "ी" (ii), "◌ू" (uu)
or "या" (yaa) are commonly linked to feminine gender.
Examples include "सीता" (Sita), "मन्जु" (Manju) and "रजनी"
(Rajni).
The gender association can be emphasized through the
presence or absence of specific characters. For instance,
"Ishani" (ईशानी) is feminine, while "Ishan" (ईशान) is mascu-
line, showcasing how the addition or removal of characters
conveys gender-specific information in Nepali names.
3.2 Dataset
The experimental dataset comprises a total of 502,468
Nepali names, with 235,870 marked as male and 266,598
as female. The dataset was acquired through web scraping
from the official website of Election Commission, Nepal[17].
The preprocessing of the scraped data involved several
steps to ensure its quality and reliability. First, faulty
names were systematically eliminated from the dataset.
Entries containing gibberish elements such as hyphens
(-), periods (.), or random numbers were identified and
removed. Names with obvious mistakes that were easily
observable were excluded from the dataset. It’s essential
to note that the original dataset remained unaltered for
reasons unrelated to the aforementioned issues.
To address the possibility of a single name being associated
with both genders, a frequency-based approach was imple-
mented to assign labels. In instances where a name had
entries for both male and female genders, the gender label
was determined based on the higher count. This approach
aimed to enhance the accuracy and consistency of gender
labeling within the dataset, acknowledging the potential
ambiguity of certain names across genders.
Table 1 presents a sample data of the overall dataset. The
table includes the first names in Nepali script along with
the corresponding counts of females and males, as well as
the gender classification.
Table 1: Sample Data
First Name #Female #Male Gender
रितेन्द्रक
ु मार 0 1 1
नित्रकला 1 0 0
बागुर 0 3 1
बेलसपुरा 26 0 0
पुलहोवा 0 1 1
The figure 1 visually represents the gender distribution
within the dataset, where ’0’ corresponds to female names
and ’1’ corresponds to male names. Notably, the bar graph
reveals that the number of occurrences for female names
(266,598) exceeds that of male names (235,870).
Figure 1: Gender Distribution of Names
Figure 2 depicts a histogram illustrating the frequency
distribution of names based on their respective lengths.
The histogram exhibits a right-skewed distribution. As
observed in the histogram , a significant proportion of
names predominantly falls within the length range of 3 to
11 characters. Moreover, there are very few names with
the name length greater than 14.
3. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
3
Figure 2: Distribution of Name Lengths
3.3 Tokenization
Tokenization in Natural Language Processing (NLP) is the
process of breaking down a text into smaller units known
as tokens. Typically, in NLP tasks, texts are broken down
into words. However, for this paper, names are tokenized
by breaking them into individual characters, as names are
often comprised of single words.
After tokenization, the CountVectorizer is employed for
feature extraction. The CountVectorizer from scikit-learn
is utilized to convert the tokenized names into a numerical
format suitable for machine learning models. This trans-
formation allows us to represent each name as a numerical
vector, capturing the relationships between individual char-
acters and providing a foundation for gender classification.
3.4 Class Weights
As the dataset is imbalanced, we utilize class weights when
finetuning pretrained models. Class weights enable adjust-
ing the loss computation to penalize false predictions for
minority classes more heavily compared to majority class.
Weight for the majority female name is 0.4691 and weight
for the minorty male name is 0.5309. This configuration
encourages the model to maintain a delicate equilibrium
balancing the learning rates for each class, ultimately bene-
fiting the final performance on the challenging imbalanced
dataset.
4 Models for Gender Detection
Traditional ML algorithms like like Random Forest , K
Neighbours Classifier, Decision Trees, Naive Bayes, Logistic
Regression, LDA, QDA, SVM, gradient boosting along with
BERT based classifier like DistilBERT and DeBERTa were
trained and evaluated for the task. There were 2 types
of method based on input of the traditional ML models.
These two types of method were trained separately based
on inputs and there result is compared.
• Experiment 1: Only first name was given as
input to model.
• Experiment 2: First name along with additional
encoding for name endings was given as input to
model. For instance, if the first name is एलिशा,
then additional encoding for name endings i.e "ा"
(ii) is also added to first name as input. So, the
input for this experiment becomes एलिशा + "ा".
Figure 3 illustrates the comprehensive flow of Experiment
1 involving traditional ML algorithms. The process begins
with tokenizing the input at the character level, followed
by vectorization. Subsequently, the model is trained based
on the chosen algorithm, leading to the classification of the
input name.
Figure 3: Experiment 1 Flow Diagram
Figure 4 illustrates flow of Experiment 2. It is similar
to the Experiment 1 except additional encoding is added
for the name endings. The input is is tokenized at the
character level, followed by vectorization. Then, the input
is forwarded to the choosen ML model, leading to the
classification of the input name.
4. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
4
Figure 4: Experiment 2 Flow Diagram
4.1 Decision Tree
Decision tree, a widely used supervised learning method
for tackling classification and regression problems, oper-
ates as a structured classifier. In this framework, internal
nodes correspond to features of a dataset, branches repre-
sent the decision-making process, and each leaf node indi-
cates the classification result. This approach involves split-
ting datasets into tree-like structures to facilitate decision-
making [18]. For our problem, we employed a Decision Tree
as a classifier. Our decision tree model, designed to pre-
dict the gender of a given name, utilizes the Gini impurity
function. The entire training set serves as the root, and
the character-level vector is recursively distributed based
on gender.
4.2 Random Forest
Random forests are a group of methods that involve creat-
ing an ensemble (or forest) of decision trees. These trees are
developed using a randomized version of the tree induction
algorithm. We implemented a Random Forest classifier
with the parameter n_estimators set to 100, specifying the
number of trees in the forest. Furthermore, we employed
the Gini impurity function to evaluate the quality of splits
within each decision tree.
4.3 LDA and QDA
Linear Discriminant Analysis (LDA) and Quadratic Dis-
criminant Analysis (QDA) are two fundamental classifica-
tion methods in statistical and probabilistic learning [19].
LDA assumes that features are normally distributed and
have a common covariance matrix for each class. It seeks
a linear combination of features to maximize the separa-
tion between class means while minimizing within-class
spread. On the other hand, QDA relaxes the assumption
of a common covariance matrix, allowing each class to have
its covariance structure.
4.4 Naive Bayes
The Naive Bayes Classifier relies on the Bayesian theo-
rem and is well-suited for scenarios with high-dimensional
inputs. Despite its simplicity, Naive Bayes has demon-
strated effectiveness, often outperforming more intricate
classification methods [20].
4.5 K Neighbors Classifier
It involves categorizing new examples based on the pre-
dominant category assignment of their closest counterparts
within a reference dataset. We used K-nearest neighbors
for binary classification. The classifier had 5 neighbors and
used a simple weight function.
4.6 Support Vector Machine
Support Vector Machine (SVM) stands as a widely adopted
Supervised Learning algorithm applied to both classifica-
tion and regression tasks, with a primary focus on Classifi-
cation problems in Machine Learning. The objective of the
SVM algorithm is to identify the optimal line or decision
boundary, effectively classifying n-dimensional space to
assign new data points to their respective categories. The
concept of a hyperplane is central to achieving the best
decision [18].
4.7 Logistic Regression
Logistic regression is a classification techniques which is
used to identify the most sutable model to establish a rela-
tionship between the dependent and independent variables.
In this study, the logistic regression classifier was imple-
mented with a stopping criteria tolerance of 0.0001 and
norm-L2 in the penalization.
4.8 Gradient Boosting
Gradient Boosting Classifier is a machine learning algo-
rithm used for classification tasks. It is a part of the
ensemble learning methods and is particularly powerful
for building robust and accurate predictive models. The
algorithm builds an ensemble of weak learners, typically
decision trees, and combines them to create a strong pre-
dictive model.
4.9 DeBERTa and DistilBERT
4.9.1 DeBERTa
DeBERTa (Decoding-enhanced BERT with Disentangled
Attention) is a Transformer-based neural language model
that aims to improve the BERT and RoBERTa models. It
introduces two novel techniques Disentangled Attention
which enhances the model’s ability to capture intricate
relationships between tokens by reducing attention entan-
glement, and Enhanced Mask Decoder which refines the
5. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
5
decoding mechanism for improved language understanding.
These mechanisms collectively contribute to DeBERTa’s en-
hanced performance in various natural language processing
tasks [21].
4.9.2 DistilBERT
DistilBERT is a compressed version of BERT (Bidirectional
Encoder Representations from Transformers), designed for
improved computational efficiency while maintaining con-
siderable performance. Knowledge distillation is performed
during the pre-training phase to reduce the size of a larger
BERT model by 40%. Developed by Hugging Face, Distil-
BERT uses knowledge distillation, learning from the larger
BERT model as its teacher. With a reduced number of
layers and parameters, DistilBERT achieves a streamlined
architecture, making it suitable for resource-constrained
environments [22, 23].
5 Performance Metrics
To evaluate the performance of the trained model we are
using 4 performance metrics namely Accuracy, Recall, Pre-
cision and F1 Score.
5.1 Accuracy
Accuracy is a number between 0 and 1. It measures the
overall correctness of a model. It is calculated as the ratio
of correctly predicted instances (sum of correct value) to
the total samples measured.
Accuracy = TP + TN
Total Predictions × 100%
5.2 Recall
Recall also known as sensitivity is the percentage of posi-
tivie samples that were correctly labelled.
Recall = TP
TP + FN × 100%
5.3 Precision
Precision also known as positive predictive value is the
percentage of samples that were correctly labelled as "pos-
itive" compared to all the samples that were labelled as
"positive".
Precision = TP
TP + FP × 100%
5.4 F1 Score
F1 scores is the harmonic mean of precision and recall. It
considers both precision and recall in a single value.
F1 Score = 2×Precision×Recall
Precision+Recall
In the above formulas,
• True Positives (TP) represent the instances that
are correctly labeled as positives by the model.
• False Negatives (FN) represent the instances
that are incorrectly labeled as negatives by the
model.
• True Negatives (TN) represent the instances
that are correctly labeled as negatives by the
model.
• Lastly, False Positives (FP) represent the in-
stances that are incorrectly labeled as positive by
the model.
6 Experimental Results
As we considered gender classification as a binary classi-
fication task, the input to the model was the first name
and output was gender. All the models were trained on
Google colab. For traditional ML models PyCaret was
used and for training DeBERTa and DistilBERT PyTorch
was used. For ML models default parameters as provided
by the PyCaret implentatioin was used expect for cross
validation where 5 fold cross validation is used. For Deep
Learning models following parameters were used.
• Train Size = 64, Valid Size = 32
• Epochs = 5, Learning Rate = 0.00001
• Loss = BCELoss, Optimizer = Adam
• Max length of input = 15
• weight decay=0.01
6.1 Reuslts for Experiment 1
Table 2 displays the results obtained from Experiment 1,
in which solely the given name served as input. Among
the BERT variants examined, DeBERTa and DistilBERT
surpassed conventional machine learning algorithms in per-
formance. Notably, DeBERTa achieved the greatest accu-
racy, reaching 0.8720, whereas DistilBERT trailed closely
behind. Regarding classical techniques, Random Forest
led the pack with an accuracy measure of 0.7682, whilst
Quadratic Discriminant Analysis (QDA) and Naive Bayes
delivered inferior results comparable to merely making a
random selection.
Table 2: Performance comparison of the algorithms on
Experiment 1
Model Name Accuracy Recall Precision F1
DeBERTa 0.8720 0.8710 0.8719 0.8714
DistilBERT 0.8699 0.8695 0.8694 0.8695
Random Forest 0.7682 0.7414 0.7588 0.7500
K Neighbors 0.7380 0.6860 0.7371 0.7107
Gradient Boosting 0.7276 0.6967 0.7153 0.7058
Decision Tree 0.7231 0.6754 0.7176 0.6958
Logistic Regression 0.7126 0.6830 0.6978 0.6903
SVM 0.7126 0.6961 0.6935 0.6942
LDA 0.7123 0.6843 0.6969 0.6906
Naive Bayes 0.5368 0.6200 0.5628 0.4784
QDA 0.5277 0.4981 0.6025 0.3943
6. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
6
6.2 Results for Experiment 2
Table 3 displays the results obtained from Experiment 2,
in which first name along with additional encoding for
name endings was provided as input. Adding additional
encoding result in better performance for traditonal ML
algorithm as around 10% in accuracy is seen for traditional
ML algorithm in experiment 2. Similarly to experiment
1, Random Forest performed better in the traditional ML
group with accuracy of 0.8525. For BERT variants, we did
not see an improvement in performance when additional
encoding is used compared to experiment 1 as the accuracy
reached to 0.8705 for DeBERTa and 0.8715 for DistilBERT.
This shows that the additional encoding of name endings
did not help in the performance of BERT variants.
Table 3: Performance comparison of the algorithms on
Experiment 1
Model Name Accuracy Recall Precision F1
DistilBERT 0.8714 0.8702 0.8715 0.8707
DeBERTa 0.8705 0.8701 0.8701 0.8701
Random Forest 0.8525 0.8516 0.8369 0.8442
K Neighbors 0.8437 0.8193 0.8431 0.8310
Logistic Regression 0.8283 0.8405 0.8027 0.8211
Gradient Boosting 0.8283 0.8596 0.7921 0.8245
LDA 0.8263 0.8526 0.7928 0.8216
SVM 0.8250 0.8589 0.7873 0.8216
Decision Tree 0.8112 0.7916 0.8030 0.7973
Naive Bayes 0.6724 0.3726 0.8404 0.5100
QDA 0.6156 0.3988 0.7898 0.4487
7 Limitation and Future Work
While conducting the study, several opportunities for ex-
tending and deepening the exploration emerged. However,
certain limitations were identified, which can be investi-
gated further:
• Like name endings, middle name play important
role in gender classification for Nepali name. For
example name with middle name like क
ु मार, राज,
लाल, प्रसाद, बहादुर, etc is conisdered masculine
whereas name with middle name like देवी, क
ु मारी,
माया, etc is considered feminine names. To improve
the performance of gender classification model fur-
ther investigation on the importance of middle
name could be considered.
• N-grams of the name endings could also be benifi-
cial along with the name endings. As in our study
we only considered the last character and not the
n-gram of endings like bigram or trigram of name
endings, we see a potential of improvement if we
consider n-grams too.
• A dictionary based approach can be considered
for common Nepali names for fast inference and
accurate results.
8 Conclusion
In conclusion, this study addressed the significant gap in
gender classification research for low-resource language
Nepali, specifically focusing on names written in the De-
vanagari Scrip. Moreover, the research focus on comparing
the performance of traditional machine learning algorithms
and advanced contextualized transformer architectures,
specifically fine-tuned variants of BERT, for gender classi-
fication in Nepali names. Utilizing an imbalanced dataset
with class weights assigned to handle the disparity between
the majority and minority classes, the study discovered
that DeBERTa outperformed other models in the task,
followed closely by DistilBERT. Among traditional algo-
rithms, Random Forest proved to be the most successful.
Although these findings offer meaningful contributions to
gender classification in the Nepali linguistic landscape, fur-
ther research remains necessary to explore the role of mid-
dle names, n-grams of name endings, and dictionary-based
approaches for fast and efficient inference. Continuous
advancements in reliable gender detection will undoubtedly
promote more inclusive and unbiased AI services, ensuring
fairness and equality in diverse applications.
References
[1] The Guardian. How we analysed 70m comments on
the guardian website. https://www.theguardian.
com/technology/2016/apr/12/how-we-analysed-
70m-comments-guardian-website, Year.
[2] Toptal. Is open source open to women.
https://www.toptal.com/open-source/is-open-
source-open-to-women, Year.
[3] Yifan Hu, Changwei Hu, Thanh Tran, Tejaswi Kasturi,
Elizabeth Joseph, and Matt Gillingham. What’s in a
name? – gender classification of names with character
based machine learning models, 2021.
[4] Alexander Panchenko and Andrey Teterin. Detecting
gender by full name: Experiments with the russian
language. volume 436, pages 169–182, 04 2014. ISBN
978-3-319-12579-4. doi: 10.1007/978-3-319-12580-0_
17.
[5] Rosana C. B. Rego, Verônica M. L. Silva, and Victor M.
Fernandes. Predicting gender by first name using
character-level machine learning, 2021.
[6] Vikas Sheoran, Shreyansh Joshi, and Tanisha R.
Bhayani. Age and Gender Prediction Using Deep
CNNs and Transfer Learning, page 293304. Springer
Singapore, 2021. ISBN 9789811610929. doi: 10.1007/
978-981-16-1092-9_25. URL http://dx.doi.org/10.
1007/978-981-16-1092-9_25.
[7] Majid Farzaneh. Arcface knows the gender, too!, 2021.
[8] Genderapi. https://genderapi.io/, .
[9] Genderize. https://genderize.io/, .
[10] Gender-api. https://gender-api.com/, .
[11] Ke Zhang, Ce Gao, Liru Guo, Miao Sun, Xingfang
Yuan, Tony X. Han, Zhenbing Zhao, and Baogang
7. Preprint – Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari Script
7
Li. Age group and gender estimation in the wild with
deep ror architecture, 2017.
[12] Anand Venugopal, Yadukrishnan V, and Remya
Nair T. A svm based gender classification from
children facial images using local binary and non-
binary descriptors. pages 631–634, 03 2020. doi:
10.1109/ICCMC48092.2020.ICCMC-000117.
[13] Olarik Surinta and Thananchai Khamket. Gender
recognition from facial images using local gradient
feature descriptors. pages 1–6, 10 2019. doi: 10.1109/
iSAI-NLP48611.2019.9045689.
[14] Ahmad B. Hassanat, Abeer Albustanji, Ahmad S.
Tarawneh, Malek Alrashidi, Hani Alharbi, Mohammed
Alanazi, Mansoor Alghamdi, Ibrahim S Alkhazi, and
V. B. Surya Prasath. Deep learning for identifica-
tion and face, gender, expression recognition under
constraints, 2021.
[15] Sumit Goswami, Sudeshna Sarkar, and Mayur Rustagi.
Stylometric analysis of bloggers’ age and gender. 01
2009.
[16] Pradeep Vashisth and Kevin Meehan. Gender classi-
fication using twitter text data. pages 1–6, 06 2020.
doi: 10.1109/ISSC49989.2020.9180161.
[17] Voter list database - election commission of
nepal. https://election.gov.np/np/page/voter-
list-db.
[18] D. Kamelesun, R. Saranya, and P. Kathiravan. A
benchmark study by using various machine learning
models for predicting covid-19 trends, 2023.
[19] Benyamin Ghojogh and Mark Crowley. Linear and
quadratic discriminant analysis: Tutorial, 2019.
[20] Vikramkumar, Vijaykumar B, and Trilochan. Bayes
and naive bayes classifier, 2014.
[21] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and
Weizhu Chen. Deberta: Decoding-enhanced bert with
disentangled attention, 2021.
[22] Victor Sanh, Lysandre Debut, Julien Chaumond, and
Thomas Wolf. Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter, 2020.
[23] Hugging Face. Distilbert documentation, 2023. URL
https://huggingface.co/docs/transformers/
model_doc/distilbert.