Discrete clusters formulation through the exploitation of optimized k-modes algorithm for hypotheses validation in social work research: the case of greek social workers working with refugees
This article focuses on the results of self-funded quantitative research conducted by social workers working inthe “refugee” crisis and social services in Greece (1). The research, among other findings, argues that front-lineprofessionals possess specific characteristics regarding their working profile. Statistical methods in the researchperformed significance tests to validate the initial hypotheses concerning the correlation between dataset variables.On the contrary of this concept, in this work, we present an alternative approach for validating initial hypothesesthrough the exploitation of clustering algorithms. Toward that goal, we evaluated several frequently used clusteringalgorithms regarding their efficiency in feature selection processes, and we finally propose a modified k-Modesalgorithm for efficient feature subset selection.
PARTICIPATION ANTICIPATING IN ELECTIONS USING DATA MINING METHODSIJCI JOURNAL
Anticipating the political behavior of people will be considerable help for election candidates to assess the
possibility of their success and to be acknowledged about the public motivations to select them. In this
paper, we provide a general schematic of the architecture of participation anticipating system in
presidential election by using KNN, Classification Tree and Naïve Bayes and tools orange based on crisp
which had hopeful output. To test and assess the proposed model, we begin to use the case study by
selecting 100 qualified persons who attend in 11th presidential election of Islamic republic of Iran and
anticipate their participation in Kohkiloye & Boyerahmad. We indicate that KNN can perform anticipation
and classification processes with high accuracy in compared with two other algorithms to anticipate
participation.
In this case study we identify the factors that influence the adoption of a new system in a major company in Saudi Arabia. We develop a theoretical framework to help derive better understanding of system adoption via socio-technical integration.
We formulation of 14 hypotheses that were tested via a survey of 42 system users. Management support and change management were found to be significant factors influencing system adoption. As a result, the 14 null hypotheses were rejected due to their statistical significance (p-value < 0.05). Discussions and recommendations for future research are discussed.
Artificial intelligence involves two basic ideas. First, it involves studying the thought processes of human beings. Second, it deals with representing those processes via machines (like computers, robots, etc.). Artificial intelligence (AI) technologies and techniques have useful purposes in every domain of mental health care including clinical decision-making, treatments, assessment, self-care, mental health care management and more. This application involves an AI based fuzzy expert system which helps the students to give a basic idea or insight of possible career opportunities, to enable them to move forward towards the path most suitable for them in all respects. This project will give a personal aid to the students taking into consideration, the student’s interest and aptitude test result. The fuzzy expert in our project will choose accurate careers for the user accordingly.
PARTICIPATION ANTICIPATING IN ELECTIONS USING DATA MINING METHODSIJCI JOURNAL
Anticipating the political behavior of people will be considerable help for election candidates to assess the
possibility of their success and to be acknowledged about the public motivations to select them. In this
paper, we provide a general schematic of the architecture of participation anticipating system in
presidential election by using KNN, Classification Tree and Naïve Bayes and tools orange based on crisp
which had hopeful output. To test and assess the proposed model, we begin to use the case study by
selecting 100 qualified persons who attend in 11th presidential election of Islamic republic of Iran and
anticipate their participation in Kohkiloye & Boyerahmad. We indicate that KNN can perform anticipation
and classification processes with high accuracy in compared with two other algorithms to anticipate
participation.
In this case study we identify the factors that influence the adoption of a new system in a major company in Saudi Arabia. We develop a theoretical framework to help derive better understanding of system adoption via socio-technical integration.
We formulation of 14 hypotheses that were tested via a survey of 42 system users. Management support and change management were found to be significant factors influencing system adoption. As a result, the 14 null hypotheses were rejected due to their statistical significance (p-value < 0.05). Discussions and recommendations for future research are discussed.
Artificial intelligence involves two basic ideas. First, it involves studying the thought processes of human beings. Second, it deals with representing those processes via machines (like computers, robots, etc.). Artificial intelligence (AI) technologies and techniques have useful purposes in every domain of mental health care including clinical decision-making, treatments, assessment, self-care, mental health care management and more. This application involves an AI based fuzzy expert system which helps the students to give a basic idea or insight of possible career opportunities, to enable them to move forward towards the path most suitable for them in all respects. This project will give a personal aid to the students taking into consideration, the student’s interest and aptitude test result. The fuzzy expert in our project will choose accurate careers for the user accordingly.
de l'Economie et des Finances | En charge des questions financières et monéta...aminellaoui
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining interesting research areas. Data mining tools and techniques help to discover and understand hidden patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate clustering method and optimal number of clusters in healthcare data can be confusing and difficult most times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms. This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
A Framework for Statistical Simulation of Physiological Responses (SSPR).Waqas Tariq
The problem of variable selection from a large number of variables to predict certain important dependent variables has been of interest to both applied statisticians and other researchers in applied physiology. For this purpose, various statistical techniques have been developed. This framework embedded various statistical techniques of sampling and resampling and help in Statistical Simulation for Physiological Responses under different Environmental condition. The population generation and other statistical calculations are based on the inputs provided by the user as mean vector and covariance matrix and the data. This framework is developed in a way that it can work for the original data as well as for simulated data generated by the software. Approach: The mean vector and covariance matrix are sufficient statistics when the underlying distribution is multivariate normal. This framework uses these two inputs and is able to generate simulated multivariate normal population for any number of variables. The software changes the manual operation into a computer-based system to automate the study, provide efficiency, accuracy, timelessness, and economy. Result: A complete framework that can statistically simulate any type and any number of responses or variables. If the simulated data is analyzed using statistical techniques; the results of such analysis will be the same as that using the original data. If the data is missing for some of the variables, in that case the system will also help. Conclusion: The proposed system makes it possible to carry out the physiological studies and statistical calculations even if the actual data is not present.
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
An approach for discrimination prevention in data miningeSAT Journals
Abstract In the age of Database technologies a large amount of data is collected and analyzed by using data mining techniques. However, the main issue in data mining is potential privacy invasion and potential discrimination. One of the techniques used in data mining for making decision is classification. On the other hand, if the dataset is biased then the discriminatory decision may occur. Therefore, in this paper we review the recent state of the art approaches for antidiscrimination techniques and also focuses on discrimination discovery and prevention in data mining. On the other hand, we also study a theoretical proposal for enhancing the results of the data quality. Keywords- Antidiscrimination, data mining, direct and indirect discrimination prevention, rule protection, rule generalization, privacy.
GA-CFS APPROACH TO INCREASE THE ACCURACY OF ESTIMATES IN ELECTIONS PARTICIPATIONijfcstjournal
Prediction the main indexes of participation in election and its effective factors are serious challenges for
political and social planners. By respect to abnormal nature of offered analyzes by political scientists, data
searchers tried to solve the problems of other methods by discovering hidden sciences of data. In this
paper, we represent combined method of Genetic Algorithm (GA) and Correlation-based feature selection
(CFS) for increasing precision of classifying in methods based on data searching for participation in
election which identifies and removes noised Feature of total set of them. Results of our paper indicated
that our offered method could increase precision of other methods prediction.
Air quality forecasting using convolutional neural networksBIJIAM Journal
Air pollution is now one of the biggest environmental risks, which causes more than 6 million premature deathseach year from heart diseases, stroke, diabetes, respiratory disease, and so on. Protecting humans from thedamage which is caused by air pollution is one of the major issues for the global community. The prediction ofair pollution can be done by machine learning (ML) algorithms. ML combines statistics and computer science tomaximize the prediction power. ML can be also used to predict the air quality index (AQI). The aim of this researchis to develop a convolutional neural network (CNN) model to predict air quality from the unseen data set, whichincludes concentration of nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). The proposedsystem will be implemented in two steps; the first step will focus on data analysis and pre-processing, includingfiltering, feature extraction, constructing convolutional neural network layers, and optimizing the parameters ofeach layer, while the second step is used to evaluate its model accuracy. The output is predicted as AQI for thedeveloped CNN model. The developed CNN model achieves a root-mean-square error of 13.4150 and a highaccuracy of 86.585%. The overall model is implemented using MATLAB software.
Prevention of fire and hunting in forests using machine learning for sustaina...BIJIAM Journal
Deforestation, illegal hunting, and forest fires are a few current issues that have an impact on the diversity andecosystem of forests. To increase the biodiversity of species and ecosystems, it becomes imperative to preservethe forest. The conventional techniques employed to prevent these issues are costly, less effective, and insecure.The current systems are unreliable and use more energy. By utilizing an Internet of Things (IOT) system, thistechnology offers a more practical and economical method of continuously maintaining and monitoring the statusof the forest. To guarantee excellent security, this system combines a number of sensors, alarms, cameras, lights,and microphones. It aids in reducing forest loss, animal trafficking, and forest fires. In the suggested system,sensors are used for monitoring, and cloud storage is used for data storage. Through the use of machine learning,the raspberry pi camera module significantly aids in the prevention of unlawful wildlife hunting as well as thedetection and prevention of forest fires.
More Related Content
Similar to Discrete clusters formulation through the exploitation of optimized k-modes algorithm for hypotheses validation in social work research: the case of greek social workers working with refugees
de l'Economie et des Finances | En charge des questions financières et monéta...aminellaoui
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
de l'Economie et des Finances | En charge des questions financières et monétaires
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
Identifying Structures in Social Conversations in NSCLC Patients through the ...IJERA Editor
The exploration of social conversations for addressing patient’s needs is an important analytical task in which
many scholarly publications are contributing to fill the knowledge gap in this area. The main difficulty remains
the inability to turn such contributions into pragmatic processes the pharmaceutical industry can leverage in
order to generate insight from social media data, which can be considered as one of the most challenging source
of information available today due to its sheer volume and noise. This study is based on the work by Scott
Spangler and Jeffrey Kreulen and applies it to identify structure in social media through the extraction of a
topical taxonomy able to capture the latent knowledge in social conversations in health-related sites. The
mechanism for automatically identifying and generating a taxonomy from social conversations is developed and
pressured tested using public data from media sites focused on the needs of cancer patients and their families.
Moreover, a novel method for generating the category’s label and the determination of an optimal number of
categories is presented which extends Scott and Jeffrey’s research in a meaningful way. We assume the reader is
familiar with taxonomies, what they are and how they are used.
CLUSTERING ALGORITHM FOR A HEALTHCARE DATASET USING SILHOUETTE SCORE VALUEijcsit
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining
interesting research areas. Data mining tools and techniques help to discover and understand hidden
patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate
clustering method and optimal number of clusters in healthcare data can be confusing and difficult most
times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but
it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms.
This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable
algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans
and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means
algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed
DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and
different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms
have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm
performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
The huge amount of healthcare data, coupled with the need for data analysis tools has made data mining interesting research areas. Data mining tools and techniques help to discover and understand hidden patterns in a dataset which may not be possible by mainly visualization of the data. Selecting appropriate clustering method and optimal number of clusters in healthcare data can be confusing and difficult most times. Presently, a large number of clustering algorithms are available for clustering healthcare data, but it is very difficult for people with little knowledge of data mining to choose suitable clustering algorithms. This paper aims to analyze clustering techniques using healthcare dataset, in order to determine suitable algorithms which can bring the optimized group clusters. Performances of two clustering algorithms (Kmeans and DBSCAN) were compared using Silhouette score values. Firstly, we analyzed K-means algorithm using different number of clusters (K) and different distance metrics. Secondly, we analyzed DBSCAN algorithm using different minimum number of points required to form a cluster (minPts) and different distance metrics. The experimental result indicates that both K-means and DBSCAN algorithms have strong intra-cluster cohesion and inter-cluster separation. Based on the analysis, K-means algorithm performed better compare to DBSCAN algorithm in terms of clustering accuracy and execution time.
A Framework for Statistical Simulation of Physiological Responses (SSPR).Waqas Tariq
The problem of variable selection from a large number of variables to predict certain important dependent variables has been of interest to both applied statisticians and other researchers in applied physiology. For this purpose, various statistical techniques have been developed. This framework embedded various statistical techniques of sampling and resampling and help in Statistical Simulation for Physiological Responses under different Environmental condition. The population generation and other statistical calculations are based on the inputs provided by the user as mean vector and covariance matrix and the data. This framework is developed in a way that it can work for the original data as well as for simulated data generated by the software. Approach: The mean vector and covariance matrix are sufficient statistics when the underlying distribution is multivariate normal. This framework uses these two inputs and is able to generate simulated multivariate normal population for any number of variables. The software changes the manual operation into a computer-based system to automate the study, provide efficiency, accuracy, timelessness, and economy. Result: A complete framework that can statistically simulate any type and any number of responses or variables. If the simulated data is analyzed using statistical techniques; the results of such analysis will be the same as that using the original data. If the data is missing for some of the variables, in that case the system will also help. Conclusion: The proposed system makes it possible to carry out the physiological studies and statistical calculations even if the actual data is not present.
Selection of Articles using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
Outcomes in health-related issues including psychological, educational, Behavioral, environmental, and social are intended to sustain positive change by digital interferences. These changes may be delivered using any digital device like a phone or computer, and make them gainful for the provider. Complex and large-scale datasets that contain usage data can be yielded by testing a digital intervention. This data provides invaluable detail about how the users interact with these interventions and notify their knowledge of engagement, if they are analyzed properly. This paper recommends an innovative framework for the process of analyzing usage associated with a digital intervention .
PhD Assistance is an Academic The Best Dissertation Writing Service & Consulting Support Company established in 2001. specialiWeze in providing PhD Assignments, PhD Dissertation Writing Help , Statistical Analyses, and Programming Services to students in the USA, UK, Canada, UAE, Australia, New Zealand, Singapore and many more.
Website Visit: https://bit.ly/3dANXUD
Contact Us:
UK NO: +44-1143520021
India No: +91-8754446690
Email: info@phdassistance.com
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
An approach for discrimination prevention in data miningeSAT Journals
Abstract In the age of Database technologies a large amount of data is collected and analyzed by using data mining techniques. However, the main issue in data mining is potential privacy invasion and potential discrimination. One of the techniques used in data mining for making decision is classification. On the other hand, if the dataset is biased then the discriminatory decision may occur. Therefore, in this paper we review the recent state of the art approaches for antidiscrimination techniques and also focuses on discrimination discovery and prevention in data mining. On the other hand, we also study a theoretical proposal for enhancing the results of the data quality. Keywords- Antidiscrimination, data mining, direct and indirect discrimination prevention, rule protection, rule generalization, privacy.
GA-CFS APPROACH TO INCREASE THE ACCURACY OF ESTIMATES IN ELECTIONS PARTICIPATIONijfcstjournal
Prediction the main indexes of participation in election and its effective factors are serious challenges for
political and social planners. By respect to abnormal nature of offered analyzes by political scientists, data
searchers tried to solve the problems of other methods by discovering hidden sciences of data. In this
paper, we represent combined method of Genetic Algorithm (GA) and Correlation-based feature selection
(CFS) for increasing precision of classifying in methods based on data searching for participation in
election which identifies and removes noised Feature of total set of them. Results of our paper indicated
that our offered method could increase precision of other methods prediction.
American International Journal of Business Management (AIJBM).
Similar to Discrete clusters formulation through the exploitation of optimized k-modes algorithm for hypotheses validation in social work research: the case of greek social workers working with refugees (20)
Air quality forecasting using convolutional neural networksBIJIAM Journal
Air pollution is now one of the biggest environmental risks, which causes more than 6 million premature deathseach year from heart diseases, stroke, diabetes, respiratory disease, and so on. Protecting humans from thedamage which is caused by air pollution is one of the major issues for the global community. The prediction ofair pollution can be done by machine learning (ML) algorithms. ML combines statistics and computer science tomaximize the prediction power. ML can be also used to predict the air quality index (AQI). The aim of this researchis to develop a convolutional neural network (CNN) model to predict air quality from the unseen data set, whichincludes concentration of nitrogen dioxide (NO2), carbon monoxide (CO), and sulfur dioxide (SO2). The proposedsystem will be implemented in two steps; the first step will focus on data analysis and pre-processing, includingfiltering, feature extraction, constructing convolutional neural network layers, and optimizing the parameters ofeach layer, while the second step is used to evaluate its model accuracy. The output is predicted as AQI for thedeveloped CNN model. The developed CNN model achieves a root-mean-square error of 13.4150 and a highaccuracy of 86.585%. The overall model is implemented using MATLAB software.
Prevention of fire and hunting in forests using machine learning for sustaina...BIJIAM Journal
Deforestation, illegal hunting, and forest fires are a few current issues that have an impact on the diversity andecosystem of forests. To increase the biodiversity of species and ecosystems, it becomes imperative to preservethe forest. The conventional techniques employed to prevent these issues are costly, less effective, and insecure.The current systems are unreliable and use more energy. By utilizing an Internet of Things (IOT) system, thistechnology offers a more practical and economical method of continuously maintaining and monitoring the statusof the forest. To guarantee excellent security, this system combines a number of sensors, alarms, cameras, lights,and microphones. It aids in reducing forest loss, animal trafficking, and forest fires. In the suggested system,sensors are used for monitoring, and cloud storage is used for data storage. Through the use of machine learning,the raspberry pi camera module significantly aids in the prevention of unlawful wildlife hunting as well as thedetection and prevention of forest fires.
Object detection and ship classification using YOLO and Amazon RekognitionBIJIAM Journal
The need for reliable automated object detection and classification in the maritime sphere has increasedsignificantly over the last decade. The increased use of unmanned marine vehicles necessitates the developmentand use of an autonomous object detection and classification systems that utilize onboard cameras and operatereliably, efficiently, and accurately without the need for human intervention. As the autonomous vehicle realmmoves further away from active human supervision, the requirements for a detection and classification suite callfor a higher level of accuracy in the classification models. This paper presents a comparative study using differentclassification models on a large maritime vessel image dataset. For this study, additional images were annotatedusing Roboflow, focusing on the types of subclasses that had the lowest detection performance, a previous workby Brown et al. (1). The present study uses a dataset of over 5,000 images. Using the enhanced set of imageannotations, models were created in the cloud using Google Colab using YOLOv5, YOLOv7, and YOLOv8 as wellas in Amazon Web Services using Amazon Rekognition. The models were tuned and run for five runs of 150 epochseach to collect efficiency and performance metrics. Comparison of these metrics from the YOLO models yieldedinteresting improvements in classification accuracies for the subclasses that previously had lower scores, but atthe expense of the overall model accuracy. Furthermore, training time efficiency was improved drastically using thenewer YOLO APIs. In contrast, using Amazon Rekognition yielded superior accuracy results across the board, butat the expense of the lowest training efficiency.
Machine learning for autonomous online exam frauddetection: A concept designBIJIAM Journal
E-learning (EL) has emerged as one of the most valuable means for continuing education around the world,especially in the aftermath of the global pandemic that included a variety of obstacles. Real-time onlineassessments have become a significant concern for educational organizations. Instances of fraudulent behaviorduring online exams (OEs) have created considerable challenges for exam invigilators, who are unable to identifyand remove such dishonest behavior. In response to this significant issue, educational institutions have used avariety of manual procedures to alleviate the situation, but none of these measures have shown to be particularlyinnovative or effective. The current study presents a novel strategy for detecting fraudulent actions in real timeduring OEs that uses convolutional neural network (CNN) algorithms and image processing. The developmentmodel will be trained using the CK and CK++ datasets. The training procedure will use 80% of the selecteddataset, with the remaining 20% used for model testing to confirm the model’s efficacy and generalization capacity.This project intends to revolutionize the monitoring and prevention of fraudulent actions during online tests byintegrating CNN techniques and image processing. The use of CK and CK++ datasets, as well as an 80–20split for training and testing, contributes to the study’s thorough and rigorous approach. Educational institutionscan improve their assessment procedures and maintain the credibility of EL as a credible and equitable way ofcontinuing education by successfully using this unique technique.
Prognostication of the placement of students applying machine learning algori...BIJIAM Journal
Placement is the process of connecting the selected candidate with the employer. Every student might have adream of having a job offer when he or she is about to complete her course. All educational institutions aim athaving their students well placed in good organizations. The reputation of any institution depends on the placementof its students. Hence, many institutions try hard to have a good placement cell. Classification using machinelearning may be utilized to retrieve data from the student-databases. A prediction model that can foretell theeligibility of the students based on their academic and extracurricular achievements is proposed. Related data wascollected from many institutions for which the placement-prediction is made. This paradigm is being weighed upwith the existing algorithms, and findings have been made regarding the accuracy of predictions. It was found thatthe proposed algorithm performed significantly better and yielded good results.
Drug recommendation using recurrent neural networks augmented with cellular a...BIJIAM Journal
Drug recommendation systems are systems that have the capability to recommend drugs. On a daily basis, a huge amount of data is being generated by the patients. All this valuable data can be properly utilized to create a reliable drug recommendation system. In this paper, we recommend a system for drug recommendations. The main scope of our system is to predict the correct medication based on reviews and ratings. Our proposed system uses natural language processing techniques (NLP), recurrent neural networks (RNN), and cellular automata (CA). We also considered various metrics like precision, recall, accuracy, F1 score, and ROC curve as measures of our system’s performance. NLP techniques are being used for gathering useful information from patient data, and RNN is a machine learning methodology that works really well in analyzing textual data. The system considers various patient data attributes like age, gender, dosage, medical history, and symptoms in order to make appropriate predictions. The proposed system has the potential to help medical professionals make informed drug recommendations.
Emotion Recognition Based on Speech Signals by Combining Empirical Mode Decom...BIJIAM Journal
This paper proposes a novel method for speech emotion recognition. Empirical mode decomposition (EMD) is applied in this paper for the extraction of emotional features from speeches, and a deep neural network (DNN) is used to classify speech emotions. This paper enhances the emotional components in speech signals by using EMD with acoustic feature Mel-Scale Frequency Cepstral Coefficients (MFCCs) to improve the recognition rates of emotions from speeches using the classifier DNN. In this paper, EMD is first used to decompose the speech signals, which contain emotional components into multiple intrinsic mode functions (IMFs), and then emotional features are derived from the IMFs and are calculated using MFCC. Then, the emotional features are used to train the DNN model. Finally, a trained model that could recognize the emotional signals is then used to identify emotions in speeches. Experimental results reveal that the proposed method is effective.
Enhanced 3D Brain Tumor Segmentation Using Assorted Precision TrainingBIJIAM Journal
A brain tumor is a medical disorder faced by individuals of all demographics. Medically, it is described as the spread of nonessential cells close to or throughout the brain. Symptoms of this ailment include headaches, seizures, and sensory changes. This research explores two main categories of brain tumors: benign and malignant. Benign spreads steadily, and malignant express growth makes it dangerous. Early identification of brain tumors is a crucial factor for the survival of patients. This research provides a state-of-the-art approach to the early identification of tumors within the brain. We implemented the SegResNet architecture, a widely adopted architecture for threedimensional segmentation, and trained it using the automatic multi-precision method. We incorporated the dice loss function and dice metric for evaluating the model. We got a dice score of 0.84. For the tumor core, we got a dice score of 0.84; for the whole tumor, 0.90; and for the enhanced tumor, we got a score of 0.79.
Hybrid Facial Expression Recognition (FER2013) Model for Real-Time Emotion Cl...BIJIAM Journal
Facial expression recognition is a vital research topic in most fields ranging from artificial intelligence and gaming to human–computer interaction (HCI) and psychology. This paper proposes a hybrid model for facial expression recognition, which comprises a deep convolutional neural network (DCNN) and a Haar Cascade deep learning architecture. The objective is to classify real-time and digital facial images into one of the seven facial emotion categories considered. The DCNN employed in this research has more convolutional layers, ReLU activation functions, and multiple kernels to enhance filtering depth and facial feature extraction. In addition, a Haar Cascade model was also mutually used to detect facial features in real-time images and video frames. Grayscale images from the Kaggle repository (FER-2013) and then exploited graphics processing unit (GPU) computation to expedite the training and validation process. Pre-processing and data augmentation techniques are applied to improve training efficiency and classification performance. The experimental results show a significantly improved classification performance compared to state-of-the-art (SoTA) experiments and research. Also, compared to other conventional models, this paper validates that the proposed architecture is superior in classification performance with an improvement of up to 6%, totaling up to 70% accuracy, and with less execution time of 2,098.8 s.
Role of Artificial Intelligence in Supply Chain ManagementBIJIAM Journal
The term “supply chain” refers to a network of facilities that includes a variety of companies. To minimise the entire cost of the supply chain, these entities must collaborate. This research focuses on the use of Artificial Intelligence techniques in supply chain management. It includes supply chain management examples like as
demand forecasting, supply forecasting, text analytics, pricing panning, and more to help companies improve their processes, lower costs and risk, and boost revenue. It gives us a quick rundown of all the key principles of economics and how to comprehend and use them effectively.
An Internet – of – Things Enabled Smart Fire Fighting SystemBIJIAM Journal
Fire accident is a mishap that could be either man made or natural. Fire accident occurs frequently and can be controlled but may at time result in severe loss of life and property. Many a times fire fighters struggle to sort out the exact source of fire as it is continuously flammable and it spreads all over the area. On this concern, we have designed an Internet – of – Things (IoT) based device which sorts out the exact source of fire through a software and
hardware devices and also the complete detail of an area can be visualized by fire fighters which are preinstalled in the software itself. Because of our system’s intelligence in decision making during fire fighting, the proposed system is claimed as ‘Smart Fire Fighting System’. The implementation of the smart fire fighting system can make
the fire fighters to analyze the current situation immediately and make the decision quicker in an effective manner. Because of this system, fire losses in a building can be greatly reduced and many lives can be rescued immediately and moreover fire spread can also be restricted.
Evaluate Sustainability of Using Autonomous Vehicles for the Last-mile Delive...BIJIAM Journal
This study aims to determine if autonomous vehicles (AV) for last-mile deliveries are sustainable from three perspectives: social, environmental, and economic. Because of the relevance of AV application for last-mile delivery, safety was only addressed for the sake of societal sustainability. According to this study, it is rather safe to deploy AVs for delivery since current society’s speed restriction and decent road conditions give the foundation for AVs to run safely for last-mile delivery in metropolitan areas. Furthermore, AV has a distinct advantage in the event of a pandemic. The biggest worry for environmental sustainability is the emission problem. In terms of a series of emissions, it is established that AV has a substantial benefit in terms of emission reduction. This is mostly due to the differences in driving behaviour between autonomous cars and human vehicles. Because of AV’s commercial character, this research used a quantitative approach to highlight the economic sustainability of AV. The
study demonstrates the economic benefit of AV for various carrying capacities.
Automation Attendance Systems Approaches: A Practical ReviewBIJIAM Journal
Accounting for people is the first step of every manpower-based organization in today’s world. Hence, it takes up a signification amount of energy and value in the form of money from respective organizations for both
implementing a suitable system for manpower management as well as maintaining that same system. Although this amount of expenditure for big organizations is near to nothing, rather just a formality, it does not hold as much truth for small organizations such as schools, colleges, and even universities to a certain degree. This is the first point. The second point for discussion is that much work has been done to solve this issue. Various technologies like Biometrics, RFID, Bluetooth, GPS, QR Code, etc., have been used to tackle the issues of attendance collection. This study paves
the path for researchers by reviewing practical methods and technologies used for existing attendance systems.
Transforming African Education Systems through the Application of IOTBIJIAM Journal
The project sought to provide a paradigm for improving African education systems through the use of the IOT. The created IOT model for Africa will enable African countries, notably Namibia, to exchange educational content and resources with other African countries. The objective behind the IOT paradigm in Africa’s education sectors is to provide open access to knowledge and information. The study revealed that there are no recognised platforms in African education systems that are utilised by African governments to interact, communicate, and share educational material directly with African institutions. As a result, the current research developed a model for
transforming African education systems using the IOT in the Namibian context, which will serve as a centralised online platform for self-study, new skill acquisition, and self-improvement using materials provided by African institutions of higher learning. Everyone is welcome to use the platform, including students, instructors, and
members of the general public.
Deep Learning Analysis for Estimating Sleep Syndrome Detection Utilizing the ...BIJIAM Journal
Manual sleep stage scoring is frequently performed by sleep specialists by visually evaluating the patient’s neurophysiological signals acquired in sleep laboratories. This is a difficult, time-consuming, & laborious process. Because of the limits of human sleep stage scoring, there is a greater need for creating Automatic Sleep Stage Classification (ASSC) systems. Sleep stage categorization is the process of distinguishing the distinct stages of sleep & is an important step in assisting physicians in the diagnosis & treatment of associated sleep disorders. In this research, we offer a unique method & a practical strategy to predicting early onsets of sleep disorders, such as restless leg syndrome & insomnia, using the Twin Convolutional Model FTC2, based on an algorithm composed of two modules. To provide localised time-frequency information, 30 second long epochs of EEG recordings are subjected to a Fast Fourier Transform, & a deep convolutional LSTM neural network is trained for sleep stage categorization. Automating sleep stages detection from EEG data offers a great potential to tackling sleep irregularities on a
daily basis. Thereby, a novel approach for sleep stage classification is proposed which combines the best of signal
processing & statistics. In this study, we used the PhysioNet Sleep European Data Format (EDF) Database. The code
evaluation showed impressive results, reaching accuracy of 90.43, precision of 77.76, recall of 93,32, F1-score of 89.12 with the final mean false error loss 0.09. All the source code is availlable at https://github.com/timothy102/eeg.
Robotic Mushroom Harvesting by Employing Probabilistic Road Map and Inverse K...BIJIAM Journal
A collision-free path to a destination position in a random farm is determined using a probabilistic roadmap (PRM) that can manage static and dynamic obstacles. The position of ripening mushrooms is a result of picture processing. A mushroom harvesting robot is explored that uses inverse kinematics (IK) at the target
position to compute the state of a robotic hand for grasping a ripening mushroom and plucking it. The DenavitHartenberg approach was used to create a kinematic model of a two-finger dexterous hand with three degrees of freedom for mushroom picking. Unlike prior experiments in mushroom harvesting, mushrooms are not planted in a grid or design, but are randomly scattered. At any point throughout the harvesting process, no human interaction is necessary
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Online aptitude test management system project report.pdfKamal Acharya
The purpose of on-line aptitude test system is to take online test in an efficient manner and no time wasting for checking the paper. The main objective of on-line aptitude test system is to efficiently evaluate the candidate thoroughly through a fully automated system that not only saves lot of time but also gives fast results. For students they give papers according to their convenience and time and there is no need of using extra thing like paper, pen etc. This can be used in educational institutions as well as in corporate world. Can be used anywhere any time as it is a web based application (user Location doesn’t matter). No restriction that examiner has to be present when the candidate takes the test.
Every time when lecturers/professors need to conduct examinations they have to sit down think about the questions and then create a whole new set of questions for each and every exam. In some cases the professor may want to give an open book online exam that is the student can take the exam any time anywhere, but the student might have to answer the questions in a limited time period. The professor may want to change the sequence of questions for every student. The problem that a student has is whenever a date for the exam is declared the student has to take it and there is no way he can take it at some other time. This project will create an interface for the examiner to create and store questions in a repository. It will also create an interface for the student to take examinations at his convenience and the questions and/or exams may be timed. Thereby creating an application which can be used by examiners and examinee’s simultaneously.
Examination System is very useful for Teachers/Professors. As in the teaching profession, you are responsible for writing question papers. In the conventional method, you write the question paper on paper, keep question papers separate from answers and all this information you have to keep in a locker to avoid unauthorized access. Using the Examination System you can create a question paper and everything will be written to a single exam file in encrypted format. You can set the General and Administrator password to avoid unauthorized access to your question paper. Every time you start the examination, the program shuffles all the questions and selects them randomly from the database, which reduces the chances of memorizing the questions.
We have compiled the most important slides from each speaker's presentation. This year’s compilation, available for free, captures the key insights and contributions shared during the DfMAy 2024 conference.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Discrete clusters formulation through the exploitation of optimized k-modes algorithm for hypotheses validation in social work research: the case of greek social workers working with refugees
1. BOHR International Journal of Internet of things,
Artificial Intelligence and Machine Learning
2023, Vol. 2, No. 1, pp. 11–18
DOI: 10.54646/bijiam.2023.12
www.bohrpub.com
METHODS
Discrete clusters formulation through the exploitation
of optimized k-modes algorithm for hypotheses validation
in social work research: the case of greek social workers
working with refugees
Alexis Lazanas1*, Ilias Siachos1, Dimitra-Dora Teloni2, Sofia Dedotsi2 and Aristeidis G. Telonis3
1Department of Mechanical Engineering and Aeronautics, University of Patras, Rion-Patras, Greece
2Department of Social Work, University of West Attica, Athens, Greece
3Department of Human Genetics, Miller School of Medicine, University of Miami, Coral Gables, FL, United States
*Correspondence:
Alexis Lazanas,
alexlas@upatras.gr
Received: 21 January 2023; Accepted: 07 February 2023; Published: 18 February 2023
This article focuses on the results of self-funded quantitative research conducted by social workers working in
the “refugee” crisis and social services in Greece (1). The research, among other findings, argues that front-line
professionals possess specific characteristics regarding their working profile. Statistical methods in the research
performed significance tests to validate the initial hypotheses concerning the correlation between dataset variables.
On the contrary of this concept, in this work, we present an alternative approach for validating initial hypotheses
through the exploitation of clustering algorithms. Toward that goal, we evaluated several frequently used clustering
algorithms regarding their efficiency in feature selection processes, and we finally propose a modified k-Modes
algorithm for efficient feature subset selection.
Keywords: clustering, k-Modes algorithm, social work, machine learning, refugee crisis
1. Introduction
Since social workers have been on the “front-lines" (2) of
the so-called refugee “crisis,” facing a series of difficulties
in effectively helping their users. While Greece is one of
the “entrance” countries in Europe, there has been no
current research in social work practice with refugees. The
study was a self-funded, quantitative research project with
main research questions concerning, among others, the
exploration of front-line professionals’ profiles. Analytically,
information about the research concerning: (i) aims and
hypotheses, (ii) sampling strategies and research ethics, (iii)
statistical methods and analysis, and (iv) the research‘s results
can be retrieved from (1, 3).
In this article, we present a model-based approach
in order to provide an alternative validation to the
globally acknowledged hypotheses testing method.
More specifically, several hypothesis-testing cases are
commonly performed through the application of either
the chi-square (4) or hypergeometric test (5) in order to
determine statistical significance. This procedure is typical
for discovering “correlation” between independent variables
in datasets with categorical values. Augmenting this typical
approach, we propose a new formally structured model
by using classification techniques in order to formulate
data clusters capable of validating initially composed
statistical hypotheses.
To accomplish the above goal, we adopted a rather
“typical” technique regarding the feature selection process
from our model‘s data. A candidate features sub-set is
created each time in order to categorize data points into
intuitively similar but not predefined user groups. Then, the
clustering phase is implemented through the evaluation of
the following clustering algorithms: (1) a modified k-Modes
11
2. 12 Lazanas et al.
algorithm (6), (2) agglomerative clustering (7) and (3) a
normal k-Modes algorithm (8) in order to adopt the most
feature-selection efficient one.
The remainder of the article is organized as follows:
Section "2 Related work“discusses related work issues
considered in the context of literature; Section “3 Social
workers’ profile: Validating our initial hypothesis” refers to
the statistical hypothesis testing prior to our approach, which
is described in Section “4 Our approach“, by demonstrating
the applicability of the proposed approach through analytical
steps; and finally, concluding remarks and future work
directions are outlined in Section “5 Conclusion.“
2. Related work
Clustering and hypothesis testing are two powerful
techniques used in data analysis (9). Clustering is the
process of grouping similar objects together based on
their characteristics, while hypothesis testing is a statistical
technique used to test the validity of a hypothesis. In recent
years, researchers have combined these two techniques to
gain deeper insights into their data. In this review, we will
discuss the combination of clustering and hypothesis testing
and its applications in different fields (10–13).
One of the primary applications of clustering and
hypothesis testing is in biology, where interesting reviews can
be found in Jacques and Preda (14); Wang et al. (15, 16, 17),
as well as in medical imaging data (18, 19). Researchers in
this field use these techniques to identify groups of genes
that are related to a particular disease (20) and chemometrics
(21, 22). Clustering is used to group genes that have similar
characteristics (23), such as gene expression levels (24), while
hypothesis testing is used to test whether the genes in each
cluster are associated with the disease (25). By combining
these techniques (26), researchers can identify clusters of
genes that are statistically significant and associated with the
disease (27).
Another area where the combination of clustering and
hypothesis testing is used is in finance. Researchers in
this field use clustering to group stocks that have similar
characteristics (28, 29), such as market capitalization, price-
to-earnings ratio, and dividend yield. Hypothesis testing
is used to test whether the stocks in each cluster have
significantly different returns. This can help investors identify
stocks that are undervalued or overvalued, and make more
informed investment decisions.
In the field of marketing, clustering methods (30, 31)
and hypothesis testing are used to segment customers into
different groups based on their characteristics (32) and test
whether these groups have different purchasing behaviors.
For example, a company may use clustering to group
customers based on their age, income, and purchasing history
(10). Hypothesis testing can then be used to test whether
these groups have different purchasing behaviors, such as
buying more frequently or spending more money. This can
help companies develop more targeted marketing strategies
and improve their overall sales.
In the field of image processing, clustering and hypothesis
testing are used to segment images into different regions
based on their characteristics and test whether these regions
have different properties. For example, a researcher may use
clustering to segment an image into regions based on color
or texture (33, 34). Hypothesis testing can then be used to
test whether these regions have different properties, such
as brightness or contrast. This can help researchers better
understand the properties of the image and develop more
advanced image processing algorithms.
One of the primary advantages of the combination of
clustering and hypothesis testing is that it allows researchers
to identify statistically significant groups of data that may
not be apparent using either technique alone. Clustering
can help identify groups of data that are similar (35), while
hypothesis testing can help determine whether these groups
are statistically significant. By combining these techniques,
researchers can gain a deeper understanding of their data and
develop more accurate models (36).
In conclusion, the combination of clustering (37) and
hypothesis testing is a powerful technique that has numerous
applications in different fields (14, 38, 39). By using clustering
to group similar data and hypothesis testing to test whether
these groups are statistically significant, researchers can gain
deeper insights into their data and develop more accurate
models (40, 41). This technique has been used successfully
in biology, finance, marketing, and image processing, among
other fields (42, 43), and is likely to continue to be an
important tool in data analysis in the future.
3. Social workers’ profile: validating
our initial hypothesis
A self-completed, anonymous electronic questionnaire was
available online from June until September 2018, containing
52 questions (1, 3) designed by the researchers. Out of a
total of 158 responses, 21 were incomplete and were thus
excluded. We analyzed the 137 complete responses, and
statistical significance was evaluated in R version 3.4.3, and
the results’ graphs were created using Excel 365 Pro Plus.
We employed the hypergeometric or chi-squared test, and
the statistical threshold was set at a P-value of 0.05. In this
section, the findings of our research concerning the profile
of social workers in the field of social services in Greece
are presented in order to gradually construct our main
hypothesis. To briefly describe the above concept, we note
that the discussion in Authors’ own (1) “provided important
insights on the challenges and difficulties that social work
professionals face in helping effectively their users.” One of
the main findings of this research was that social workers
3. 10.54646/bijiam.2023.12 13
working in the refugee “crisis” are young graduates with
limited work experience. We define our main hypothesis
(H0) as follows:
H0 = “social workers are young graduates with limited work
experience.”
As shown in Figure 1, regarding the social workers’ profile,
the vast majority (80%) are women and 18.3% are men. In
total, 52% are between 22 and 30 years old, while 39% are
between 31 and 39 years old. It is straightforward that there
is a plethora of under-middle-aged professionals working on
the “front-lines” with refugees, and consequently, the first
part of H0 is considered undoubtedly validated.
Apart from the fact that the profession of social work is
considered “female dominated“, as presented in Figure 1,
another critical point concerning social workers’ overall work
experience arises from the research. More specifically, as
presented in Figure 2, there is an impressive “wide area”
(almost 90%) of the professionals working on the “front
line” with limited working experience varying in the interval
between 0 and 4 years, within a statistically significant
threshold (P-value < 0.05; hypergeometric test).
Moreover, the data analysis allowed us to capture the
distribution of social workers experience in the past (total
experience) and compare it with the time of the research
(to be called current experience). As depicted in Figure 3,
responses concerning total and current experience were
relatively similar at a statistically significant level (P-
value < 0.05; hypergeometric test). To further describe this
issue, we note that the polynomial trendlines and R2 formulas
regarding total and current experience follow an identical
prediction pattern. The “current experience” trendline seems
to gain value after 3–4 years of experience. If we set “3–
4 years” as the trendlines’ curve alternation milestone, then
we can substantially argue that social workers tend to remain
in the same working position for less than the milestone of
4–5 years. This observation can be further explained but falls
outside the scope of this article.
To summarize the aforementioned findings, it is obvious
that our initial hypothesis (H0) has been validated through
the comparison of statistically significant observations,
as thoroughly described within this Section. Having
validated H0 in Section “4 Our approach,” we present our
alternative approach to testing H0 through a modified
K-Modes algorithm.
4. Our approach
As stated in Section “3 Social workers’ profile: Validating
our initial hypothesis,” our main hypotheses concern the fact
that social workers working in the front line of refugee and
immigration services are, in the majority, young graduates
with/or limited work experience (1). In order to validate
the afore-mentioned hypothesis, in this section we propose
a clustering algorithm in order to categorize data points
FIGURE 1 | Respondents’ age and gender.
FIGURE 2 | Social workers’ experience in current working position.
FIGURE 3 | Comparing social workers’ total with current experience.
into intuitively similar—but not predefined—user groups.
This process is particularly useful toward summarizing the
data and understanding the basic features that differentiate
one user group (Cj) from another (44). In other words, our
basic goal is to distinguish a set of users who possess the
main characteristics of the attributes mentioned in Section
“3 Social workers’ profile: Validating our initial hypothesis.“
4.1. Selecting input features for the
clustering algorithm
Prior to clustering, our first task is to determine the
input features. The importance stands on the fact that
many research questions (features) present high diversity
4. 14 Lazanas et al.
regarding their answers (e.g., 90% of the social workers
who have participated in the survey hold a master‘s degree).
These features can create a common problem known
as overfitting (45) and can misleadingly be considered
the main features that define the differences in users’
categorization. Plenty of methods can be utilized to define
the optimal subset of features that have to be excluded from
the clustering algorithm. Additionally, one of our major
concerns during the feature selection process was the lack of
an obvious way to evaluate the efficiency of feature selection
without any specific domain knowledge capable of guiding
the above process.
An indirect method of performing feature selection is to
implement several scenarios with subsets of the available
features and validate the efficiency of the clustering process
for each scenario. This method may delay the desired
features’ extraction but guarantees high efficiency and
reliability. Toward this, we decided to implement a greedy
algorithm that creates all possible subsets while we evaluate
the algorithm‘s efficiency for each subset.
As shown in Figure 4, every time a subset is generated,
it undergoes through the clustering process and then
gets evaluated. We define the subset with the minimum
evaluation score as the optimal result of the “feature
selection” phase. In order to calculate the evaluation score,
a variety of metrics have been proposed in the literature,
suggesting “intracluster to intercluster distance ratio” (46)
being suggested as the most reliable. The idea behind this
method is that the members of the same cluster should be
“closer” to each other than the distance from other clusters’
members. Consequently, the following metric is adopted, as
defined in Aggarwal (44)
Intra =
1
|P|
X
(Ii,Ij)∈P
dist(Ii Ij) (1)
Inter =
1
|Q|
X
(Ii,Ij)∈Q
dist(Ii Ij) (2)
dist(Ii Ij) =
1
1+Sim Ii Ij
(3)
where Sim Ii, Ij
represents the overall similarity between
two users. An above average clustering algorithm should
produce results with a relatively small intra/inter ratio.
Having calculated the ratio for every possible feature‘s subset,
we select the subset with the minimum value.
4.2. Clustering with modified k-modes
algorithm
Alternative clustering algorithms may apply to different data.
In this occasion, we implement a modified k-Modes method
FIGURE 4 | Selecting the set of features for the clustering algorithm.
(6, 47) taking into consideration that most of our data
contain categorical values. The k-Modes algorithm consists
of the following steps:
(1) First, the number of clusters Nc to be created
is chosen, and one representative for each group
is randomly picked up from the initial database
instances. The set of the representatives for all Nc
will be referred as TR ∈ RNc .
(2) In the second step, the similarity of each user to
every representative Rj is calculated and inserted
into a matrix ∈ RND × Nc , where ND is the size of
our database (i.e., the number of social workers who
participated in the research).
(3) Then, each user instance Ii is assigned to one of
the Nc clusters, by determining the maximum of
the similarity metrics for the row si of S, where
si ∈ RNc , i = 1, 2, , ND. The cluster participation
vector C ∈ RND (i.e., to which cluster every user
belongs) is then straightforwardly implemented by
the following formula:
ci = indexj(max
j
(sij))for j = 1, 2,, Ncand (4)
∀ i = 1, 2, ND
(4) Having assigned each user Ii to a Nc cluster, the
new representatives Rj must be calculated. The k-
Modes algorithm suggests that, for every cluster, a
“virtual” user is created who does not belong to the
original database. Therefore, we observe the cluster
members for every feature Fr, and we calculate the
frequency pr(xl
r) of the xl
r values. The value that is
most common among the members is assigned to
the new representative Rj. This method is referred
to as taking the “mode” of the cluster members
(6). If more than one of the xl
r values are the
most frequent for a feature Fr, the mode randomly
selects one of them.
(5) Finally, we iterate over steps 2 to 4 until either
the set of the representatives (Told
R = Tnew
R ) or
5. 10.54646/bijiam.2023.12 15
the cluster participation vector (Cold = Cnew)
remain the same.
At this point, we should note that the k-Modes parameters
are tuned appropriately to achieve optimal accuracy and
efficiency in final clusters. Generally speaking, the parameters
that need to be determined are (a) the number of clusters Nc,
(b) the similarity metric used to categorize the social workers,
and (c) the features used for the clustering process (also
referred to as feature selection). Considering our main
hypothesis that most of the social workers are young
graduates and/or of limited experience, the value Nc = 2
is chosen. Explaining further this concept, we assume that if
two distinguished groups of users are obtained, the largest of
them containing most social workers who meet the standards
of our hypothesis, then our hypothesis is indeed proven.
Regarding the metric used to calculate the similarity
between users, we consider the following users:
Ii =
xi
1, xi
2, xi
NF
and Ij = (x
j
1, x
j
2, x
j
NF
), where
xi
r and x
j
r are the values of the rth feature for users Iiand Ij,
respectively, while NF represents the total number of features
used to describe a user. The overall similarity between two
users is then defined as follows:
Sim Ii Ij
=
NF
X
r = 1
S(xi
r, x
j
r) (5)
where S
xi
r, x
j
r
denotes the similarity measure for the
individual attribute value xr of the two users. The most
straightforward technique is to set S
xi
r, x
j
r
= 1, for every
feature value that xi
r, x
j
r share in common. However, such
a metric does not imply to the overall data distribution,
meaning that the same similarity value is assigned in case
where either the value appears in most of the data or it is
of rare frequency. To overcome this drawback, we use the,
as referred in the literature, “inverse occurrence frequency”
metric (48). This metric calculates the similarity between
matching attributes of two users by leveraging the weight of
“rare” attribute values as follows:
S
xi
r x
j
r
=
(
1
pr(xi
r)2 , if xi
r = x
j
r
0,otherwise
(6)
where pr (x) = Nr(x)
N , and Nr (x) stands for the number
of users that possess the value x for the rth feature. One
problem we encountered with similarity metrics was that
in case of two features and specifically in the attributes
“Current Organization Type” and “Past Organization Type”
of our dataset, most of the users had multiple selections
in a single answer. A workaround for this problem was
to perform a modified one-hot encoding of these features.
One-hot encoding (49) is the procedure of producing binary
features labeled as all the possible values an initial feature
can take. However, this can lead to a common problem in
data science knows as the “curse of dimensionality” (50). To
overcome this problem, we applied the following technique:
The term Xi
r refers to the set of values for the rth feature of
the user Ii takes. Accordingly, the term X
j
r refers to the set of
values for the rth feature of the user Ij. X
i∩j
r will be the set that
is produced from the intersection of Xi
r and X
j
r:
X
i∩j
r = Xi
r ∩ X
j
r (7)
The similarity S
Xi
r, X
j
r
of the users for the rth attribute
can be defined as the sum of the similarity scores for the
individual values that belong to the X
i∩j
r set. The above
concept is shown in Equation 8.
S
Xi
rX
j
r
=
|
X
X
i∩j
r |l = 1
1
pr
x
i∩j
l
2
(8)
where x
i∩j
l is the lth value in the X
i∩j
r set and pr (x) = Nr(x)
ND
,
where Nr (x) stands for the number of users that possess the
value x for the rth feature.
In general, this technique bears great resemblance with
the one-hot encoding with difference that the algorithm does
not have to deal with the “curse of dimensionality,” as the
creation of new virtual features is avoided with the ad-hoc
computation of the intersection set X
i∩j
r and the “inverse
occurrence frequency” measure for the set’s elements.
The overall concept of the afore-mentioned technique is
presented as pseudo-code as follows:
begin:
Initialize the set of representatives TR
with random users from database.
Initialize the similarity matrix S with
zeros.
while (Told
R ! = Tnew
R ) do begin:
for every user Ii:
for every representative Rj:
sij = 0;
for every feature Fr:
ifFr contains multiple value:
for every value xr in Fr that Ii
and Rj have in common:
sij = 1/pr(xr)2
else if value xr of Fr is the
same for Ii and Rj :
sij = 1/pr(xr)2
Determine the cluster of Ii by
calculating ci = indexj(maxj(sij))
Append ci to the cluster
participation vector C
6. 16 Lazanas et al.
if (Cnew = = Cold):
break;
for every cluster Cj :
Determine the new representative
Rj and append it to Tnew
R
return Cluster Participation Vector C
Algorithm 1 | Modified k-Modes (Database D, Number of
Clusters Nc).
4.3. Evaluation of the clustering algorithm
performance
Our dataset consists of 136 entries and 12 features. Two
of the features (“Current Organization Type” and “Past
Organization Type”) can contain multiple values. To evaluate
our results, we compare our clustering method with (a)
the normal k-Modes algorithm (one-hot encoding for
the features that contain multiple values) and (b) the
agglomerative clustering method (51). In Table 1, we present
the best three subsets of features for each technique as they
were generated by the feature selection phase, as well as
the intra/inter score for each subset. It is obvious that the
overfitting that one-hot encoding causes does not lead to
qualitative clustering results, as it can add too much “noise”
to the features. As a result, the modified k-Modes we propose
manage to reduce the intra/inter ratio values rather than the
other clustering techniques.
To further examine the behavior of the three algorithms,
the curve of the mean intra/inter ratio to the number
of features used was designed (Figure 5). This allows us
to observe and understand what to expect from these
techniques in regards to how many features we are willing to
discard. It is proven that the modified k-Modes outperform
the other two clustering methods. In addition, while the
mean intra/inter ratio is rising for hierarchical clustering
and normal k-Modes as the number of features increases,
the case is not the same for modified k-Modes. Not only
does the mean ratio decrease, but as can be seen, the mean
does not change drastically if we increase the number of
features. This analysis indicates that the ideal number of
features to be chosen is 5, as this value offers the advantage
of fewer candidate features while maintaining appropriate
levels of clustering efficiency. This assumption is additionally
supported by the fact that the optimal subset produced
through the feature selection method for the Modified
k-Modes algorithm indeed had five features.
Taking the subset of features that yield the best evaluation
score for the modified k-Modes algorithm TF = {Age,
hasMSc, Total Experience, Current Organization Type, Total
Time in the Current Organization} leads us to the creation
of two clusters C1 and C2. The first cluster contains 86
members, and the second contains 51 users. The scatterplot
regarding “age” and “total experience” is designed in order
to compare the distribution of these two clusters. The
centroids (representatives) of each cluster are also presented
for informational purposes in Figure 6.
Due to the categorical nature of these features, the
following problem arose: if two users had the same values
for the features age and total experience, the latter would
cover the former. The solution to this problem was reached
by adding Gaussian noise to every point on the plot so that
the points would slightly differentiate around the initial spot.
It is obvious that the members of the cluster C1 are in their
majority social workers that are young people (22–30 years
old) and that their working experience is very limited, as
TABLE 1 | Algorithms’ performance evaluation results.
Algorithm Feature subset Ratio
Modified
k-Modes
Age, hasMSc, Total experience, Current
organization type, and Total time in the
current organization
0.3653
Age, Total experience, Current organization
type, Position at current organization, and
Total yime in the past organization
0.3719
Age, Total experience, Total time in the
current organization, Position at current
organization, and Total time in the past
organization
0.3733
Agglomerative
clustering
Sex, Education, and hasMSc 0.5656
Education, hasMSc, and Position at current
organization
0.5837
Education, hasMSc, and Location of residence 0.6156
Normal k-Modes Age, Total experience, Total time in the
current organization, Position at current
organization, and Total time in the past
organization
0.3733
Age, Location of residence, Total experience,
Total time in the current organization,
Position at current organization, and Total
time in the past organization
0.3852
Total Experience, Total time in the current
organization, Position at current
organization, and Total time in the past
organization
0.4027
FIGURE 5 | Mean ratio to number of features for all algorithms.
7. 10.54646/bijiam.2023.12 17
FIGURE 6 | Creating clusters with age/total experience scatterplot.
FIGURE 7 | Total experience distribution in clusters C1 and C2.
FIGURE 8 | Age distribution in clusters C1 and C2.
only 8 of them have worked for more than 5 years. On the
contrary, the users that were categorized in C2, have more
than 10 years of experience and their age distribution is
mainly 35 years old and above.
To further analyze our results, we created two graphs
that compare the “age” and “total experience” distribution
between the two clusters. As shown in Figure 7, in the
cluster C1 47 social workers have 2–3 years of experience,
while 16 of them have 0–1 year of experience. This indicates
that most of the social workers in this cluster have limited
working experience.
Finally, observing the age distribution for the same
cluster in Figure 8, it is straightforward that these social
workers are also young graduates, as most of them are aged
less than 30 years.
5. Conclusion
Social sciences cross the milestone of the artificial intelligence
era in a rapidly changing scientific sub-field such as machine
learning with enhanced data analysis tools as well as
improved algorithms and techniques. Statistical hypotheses
and testing through a variable significance level have been the
norm for a very long period, demanding a “statistical” point
of view into various problems and datasets.
Contrary to the above concept, we proposed an innovative
approach that arises from clustering algorithms and aspires
to become common ground for social science researchers
in the upcoming years. Our approach exploits a modified
k-Modes algorithm and bypasses statistical hypothesis testing
through clustering construction. Moreover, we addressed the
problem of selecting a subset of important features for the
whole data in order to be aware of the “important” features
before performing clustering. Consequently, the clustering
process becomes more efficient, focused, and strict as only the
important features are used. Therefore, our approach can be
classified as a two-step method: we first rank and then select
the subset of important features.
Finally, as a future work direction, it is important to note
that the outcomes obtained through our approach, can be
further evaluated with a number of alternative algorithms,
and, in this regard, the researchers are free to apply their
own technique and method selection or amendment and
to reapply it if necessary. While clustering allows us to
identify the sorting and allocation of observations, offering
possibilities for researchers to study, we start with an initial
number of clusters and then try to allocate the observations
to correspondent clusters, with a future evaluation of the
representativeness of each variable when creating them.
Therefore, the result of one method can serve as input to the
other, making this a “cyclical approach” or, as we define it, “a
recursive feedback method”.
Conflict of interest
The authors declare that the research was conducted in the
absence of any commercial or financial relationships that
could be construed as a potential conflict of interest.
References
1. Teloni DD, Dedotsi S, Telonis AG. Refugee ‘crisis’ and social services in
Greece: social workers’ profile and working conditions. Eur J Soc Work.
(2020) 23:1005–18.
2. Jones C. Voices from the front line: state social workers and New Labour.
Br J Soc Work. (2001) 31:547–62.
3. Teloni DD, Dedotsi S, Lazanas A, Telonis A. Social work with refugees:
examining social workers’ role and practice in times of crisis in Greece.
Int Soc Work. (2021) 1:1–18.
8. 18 Lazanas et al.
4. Lancaster HO, Seneta E. Chi-square distribution. Encycl Biostat.
(2005) 2.
5. Adams WT, Skopek TR. Statistical test for the comparison of samples
from mutational spectra. J Mol Biol. (1987) 194:391–96.
6. Bai L, Liang J. The k-modes type clustering plus between-cluster
information for categorical data. Neurocomputing. (2014) 133:111–21.
7. Kurita T. An efficient agglomerative clustering algorithm using a heap.
Pattern Recogn. (1991) 24:205–09.
8. Chaturvedi A, Green PE, Caroll JD. K-modes clustering. J Class. (2001)
18:35–55.
9. Zambom AZ, Collazos JA, Dias R. Functional data clustering via
hypothesis testing k-means. Comput Stat. (2019) 34:527–49.
10. Ferraty F, Vieu P. Nonparametric functional data analysis: theory and
practice. Berlin: Springer Science Business Media (2006)
11. Horváth L, Kokoszka P. Inference for functional data with applications,
Vol. 200. Berlin: Springer Science Business Media (2012).
12. Hsing T, Eubank R. Theoretical foundations of functional data analysis,
with an introduction to linear operators, Vol. 997. Hoboken, NJ: John
Wiley Sons (2015).
13. Kokoszka P, Reimherr M. Introduction to functional data analysis. Boca
Raton, FL: Chapman and Hall/CRC (2017).
14. Jacques J, Preda C. Functional data clustering: a survey. Adv Data Anal
Class. (2014) 8:231–55.
15. Wang JL, Chiou JM, Müller HG. Functional data analysis. Annu Rev Stat
Appl. (2016) 3:257–95.
16. Reimherr M, Nicolae D. A functional data analysis approach for genetic
association studies. Ann Appl Stat. (2014)8:406–29.
17. Young DL, Fields S. The role of functional data in interpreting the effects
of genetic variation. Mol Biol Cell. (2015) 26:3904–08.
18. Bowman FD, Guo Y, Derado G. Statistical approaches to
functional neuroimaging data. Neuroimaging Clin N Am. (2007)
17:441–58.
19. Hasenstab K, Scheffler A, Telesca D, Sugar CA, Jeste S, DiStefano C. et al.
A multi-dimensional functional principal components analysis of EEG
data. Biometrics. (2017) 73:999–1009.
20. Di Salvo F, Ruggieri M, Plaia A. Functional principal component analysis
for multivariate multidimensional environmental data. Environ Ecol
Stat. (2015) 22:739–57.
21. Saeys W, De Ketelaere B, Darius P. Potential applications of
functional data analysis in chemometrics. J Chemometr. (2008)
22:335–44.
22. Aguilera AM, Escabias M, Valderrama MJ., Aguilera-Morillo MC.
Functional analysis of chemometric data. Open J Stat. (2013)
3:334.
23. Eisen MB. Spellman PT, Brown PO, Botstein D. Cluster analysis and
display of genome-wide expression patterns. Proc Natl Acad Sci USA.
(1998) 95:14863–8.
24. Dahl DB, Newton MA. Multiple hypothesis testing by clustering
treatment effects. J Am Stat Assoc. (2007) 102:517–26.
25. Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in
microarray experiments. Stat Sci. (2003) 18:71–103.
26. Storey JD, Taylor JE, Siegmund D. Strong control, conservative point
estimation and simultaneous conservative consistency of false discovery
rates: a unified approach. J R Stat Soc Ser B. (2004) 66:187–205.
27. Benjamini Y, Yekutieli D. The control of the false discovery rate in
multiple testing under dependency. Ann Stat. (2001) 29:1165–88.
28. Ward JH Jr. Hierarchical grouping to optimize an objective function. J
Am Stat Assoc. (1963) 58:236–44.
29. Hartigan JA. Clustering algorithms. Hoboken, NJ: John Wiley Sons,
Inc (1975).
30. Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and
density estimation. J Am Stat Assoc. (2002) 97:611–31.
31. Medvedovic M, Sivaganesan S. Bayesian infinite mixture model based
clustering of gene expression profiles. Bioinformatics. (2002) 18:1194–
206.
32. Finch H. Comparison of distance measures in cluster analysis with
dichotomous data. J Data Sci. (2005) 3:85–100.
33. Tarpey T, Kinateder KK. Clustering Functional Data. J Class. (2003) 20.
34. Yamamoto M. Clustering of functional data in a low-dimensional
subspace. Adv Data Anal Class. (2012) 6:219–47.
35. Vrieze SI. Model selection and psychological theory: a discussion of
the differences between the Akaike information criterion (AIC) and
the Bayesian information criterion (BIC). Psychol Methods. (2012)
17:228.
36. Bouveyron C, Brunet-Saumard C. Model-based clustering of high-
dimensional data: a review. Computat Stat Data Anal. (2014)
71:52–78.
37. Sakamoto Y, Ishiguro M, Kitagawa G. Akaike information criterion
statistics. Dordrecht: Kluwer Academic Publishers Group (1986).
38. Bouveyron C, Jacques J. Model-based clustering of time series in group-
specific functional subspaces. Adv Data Anal Class. (2011) 5:281–300.
39. Chiou JM, Li PL. Functional clustering and identifying substructures of
longitudinal data. J R Stati Soc Ser B. (2007) 69:679–99.
40. Sugar CA, James GM. Finding the number of clusters in a dataset: an
information-theoretic approach. J Am Stat Assoc. (2003) 98:750–63.
41. Giacofci M, Lambert-Lacroix S, Marot G, Picard F. Wavelet-based
clustering for mixed-effects functional models in high dimension.
Biometrics. (2013) 69:31–40.
42. Ciollaro M, Genovese CR, Wang D. Nonparametric clustering
of functional data using pseudo-densities. Electron J Stat. (2016)
10:2922–72.
43. Bongiorno EG, Goia A. Classification methods for Hilbert data based on
surrogate density. Comput Stat Data Anal. (2016) 99:204–22.
44. Aggarwal CC. Data mining: the textbook. Berlin: Springer (2015).
45. Dietterich T. Overfitting and undercomputing in machine learning.
ACM Comput Surv. (1995) 27:326–7.
46. Ray S., Turi RH. “Determination of number of clusters in k-means
clustering and application in colour image segmentation,” in
Proceedings of the 4th international conference on advances in pattern
recognition and digital techniques, New Delhi: Narosa Publishing House
(1999), 137–43.
47. He Z, Xu X, Deng S. Attribute value weighting in k-modes clustering.
Expert Syst Appl. (2011) 38:15365–9.
48. Havrlant L, Kreinovich V. A simple probabilistic explanation of term
frequency-inverse document frequency (tf-idf) heuristic (and variations
motivated by this explanation). Int J Gen Syst. (2017) 46:27–36.
49. Dahouda MK, Joe I. A Deep-learned embedding technique for
categorical features encoding. IEEE Access. (2021) 9:114381–91.
50. Rust J. Using randomization to break the curse of dimensionality.
Econometrica. (1997) 65:487–516.
51. Müllner D. fastcluster: fast hierarchical, agglomerative clustering
routines for R and Python. J Stat Softw. (2013) 53:1–18.