The document discusses a study that evaluates the performance of the Naive Bayes and J48 classification algorithms on Swahili tweets. The study collected 276 tweets from popular Tanzanian Twitter accounts and preprocessed the data to remove non-Swahili words. The preprocessed data was then analyzed using the Naive Bayes and J48 algorithms in WEKA. The algorithms were evaluated based on accuracy, precision, recall, and ROC curve. The results showed that the Naive Bayes algorithm performed better than J48, achieving higher accuracy (36.96% vs 34.78%) and higher values for the other evaluation metrics as well. Therefore, the study found that Naive Bayes classification works better than J48 for classifying Swahili tweets.
This document summarizes a research paper on analyzing sentiments from Twitter data using data mining techniques. The paper presents an approach for analyzing user sentiments using data mining classifiers and compares the performance of single classifiers versus an ensemble of classifiers for sentiment analysis. Experimental results show that the k-nearest neighbor classifier achieved very high predictive accuracy, and single classifiers outperformed the ensemble approach.
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
We have proposed a system to determine current sentiment on twitter using Twit-
ter API for open access which includes opinions from dierent content structures like
latest news, audits, articles and social media posts. and Deep Learning method to
study Historic Data for predicting future results. we utilized Naive Bayes and dictio-
nary based algorithms to predict the sentiment on Live Twitter Data.
IRJET- Fake News Detection using Logistic RegressionIRJET Journal
1) The document discusses a study that uses logistic regression to classify news articles as real or fake. It outlines the methodology which includes data preprocessing, feature extraction using bag-of-words and TF-IDF, and using a logistic regression classifier to predict fake news.
2) The model achieved an accuracy of approximately 72% at classifying news as real or fake when using TF-IDF features and logistic regression.
3) The study aims to address the growing issue of fake news proliferation online by developing a computational method for identifying unreliable news sources.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
This document presents research on analyzing sentiment and affect in dark web forums related to radical groups. It aims to determine how effective automated methods are at measuring opinion polarity and emotion intensity in these forums. The researchers collected data from two forums - a more radical Al-Firdaws forum and a more moderate Montada forum. They used machine learning techniques like SVR ensembles and feature selection to analyze 500 sentences from each forum for sentiment and intensities of emotions like violence and hate. The results found the Al-Firdaws forum expressed more negative sentiment and intense negative emotions, confirming domain expert assessments. A time series analysis also examined how forum affects changed over time.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
The document presents a study that analyzes sentiment on Twitter using various classification algorithms. It compares the performance of Naive Bayes, Bayes Net, Discriminative Multinomial Naive Bayes, Sequential Minimal Optimization, Hyperpipes, and Random Forest algorithms on a Twitter sentiment dataset. The study finds that Discriminative Multinomial Naive Bayes and Sequential Minimal Optimization algorithms have the best performance with overall F-scores of 0.769 and 0.75, respectively. The study aims to determine the most accurate and efficient algorithms for Twitter sentiment classification.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
This document summarizes a research paper on analyzing sentiments from Twitter data using data mining techniques. The paper presents an approach for analyzing user sentiments using data mining classifiers and compares the performance of single classifiers versus an ensemble of classifiers for sentiment analysis. Experimental results show that the k-nearest neighbor classifier achieved very high predictive accuracy, and single classifiers outperformed the ensemble approach.
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
We have proposed a system to determine current sentiment on twitter using Twit-
ter API for open access which includes opinions from dierent content structures like
latest news, audits, articles and social media posts. and Deep Learning method to
study Historic Data for predicting future results. we utilized Naive Bayes and dictio-
nary based algorithms to predict the sentiment on Live Twitter Data.
IRJET- Fake News Detection using Logistic RegressionIRJET Journal
1) The document discusses a study that uses logistic regression to classify news articles as real or fake. It outlines the methodology which includes data preprocessing, feature extraction using bag-of-words and TF-IDF, and using a logistic regression classifier to predict fake news.
2) The model achieved an accuracy of approximately 72% at classifying news as real or fake when using TF-IDF features and logistic regression.
3) The study aims to address the growing issue of fake news proliferation online by developing a computational method for identifying unreliable news sources.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
This document presents research on analyzing sentiment and affect in dark web forums related to radical groups. It aims to determine how effective automated methods are at measuring opinion polarity and emotion intensity in these forums. The researchers collected data from two forums - a more radical Al-Firdaws forum and a more moderate Montada forum. They used machine learning techniques like SVR ensembles and feature selection to analyze 500 sentences from each forum for sentiment and intensities of emotions like violence and hate. The results found the Al-Firdaws forum expressed more negative sentiment and intense negative emotions, confirming domain expert assessments. A time series analysis also examined how forum affects changed over time.
- What is Clustering, Honeypots and Density Based Clustering?
- What is Optics Clustering and how is it different than DB Clustering? …and how
can it be used for outlier detection.
- What is so-called soft clustering and how is it different than clustering? …and how
can it be used for outlier detection.
The document presents a study that analyzes sentiment on Twitter using various classification algorithms. It compares the performance of Naive Bayes, Bayes Net, Discriminative Multinomial Naive Bayes, Sequential Minimal Optimization, Hyperpipes, and Random Forest algorithms on a Twitter sentiment dataset. The study finds that Discriminative Multinomial Naive Bayes and Sequential Minimal Optimization algorithms have the best performance with overall F-scores of 0.769 and 0.75, respectively. The study aims to determine the most accurate and efficient algorithms for Twitter sentiment classification.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
A scalable, lexicon based technique for sentiment analysisijfcstjournal
Rapid increase in the volume of sentiment rich social media on the web has resulted in an increased
interest among researchers regarding Sentimental Analysis and opinion mining. However, with so much
social media available on the web, sentiment analysis is now considered as a big data task. Hence the
conventional sentiment analysis approaches fails to efficiently handle the vast amount of sentiment data
available now a days. The main focus of the research was to find such a technique that can efficiently
perform sentiment analysis on big data sets. A technique that can categorize the text as positive, negative
and neutral in a fast and accurate manner. In the research, sentiment analysis was performed on a large
data set of tweets using Hadoop and the performance of the technique was measured in form of speed and
accuracy. The experimental results shows that the technique exhibits very good efficiency in handling big
sentiment data sets.
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
HU, Xiao (University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_619.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
Presents Natural Language Processing (NLP) algorithms for for Bay Area NLP reading group. Survey of Probabilistic Topic Modeling such as Latent Dirichlet Allocation (LDA). Includes practical references explaining the algorithm along with software libraries for Python, Spark, and R.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
Dr. Barry Smith is the director of the National Center for Ontological Research. He discussed how semantic technology can help solve the problem of data silos by enabling data from different sources to be integrated and analyzed together. Ontologies, or controlled vocabularies, can be used to semantically enhance data by tagging it in an interoperable way. This allows the data to be retrieved, understood, and used by others even if they were not involved in creating the data. The semantic enhancement approach aims to break down silos incrementally by coordinating the creation of ontologies and linking datasets through shared terms.
Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)SangMe Nam
This document summarizes a research paper that analyzed comments on South Korean politicians' profiles on the social networking site Cyworld. The researchers collected over 200 random comments and categorized them as positive, negative, or irrelevant. They then used machine learning algorithms like naive Bayes and support vector machines to automatically classify new comments. The algorithms achieved accuracies around 70%, outperforming a model that labeled all comments as the most common class. The tools developed in this research could help analyze political communication on social networks and public sentiment toward politicians.
This document summarizes research on building a serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo Answers. It describes extracting entities and relationships between them to build entity networks. It then details using a random walk retrieval algorithm and rank aggregation to perform searches across the networks. The researchers analyze the system's precision, MAP, and ability to provide unexpected yet relevant results. User studies found the combined system provided more relevant, interesting, and informative results compared to using Wikipedia or Yahoo Answers individually. Metadata like sentiment, readability and categories was added to entity networks to help promote serendipity.
An Abridged Version of My Statement of Research Interestsadil raja
The document summarizes the research interests of Muhammad Adil Raja in machine learning theory and applications. Specifically, his interests involve (1) developing a thorough understanding of machine learning methods, algorithms, and subdomains and (2) effectively applying machine learning to solve real-world problems. As part of his PhD research, he developed methods for real-time, non-intrusive speech quality estimation for VoIP calls using genetic programming. His research resulted in several publications and honors and he continues researching applications of machine learning.
Kuan-ming Lin is interested in data mining, particularly mining biological databases, web documents, and the semantic web. He has skills in data mining techniques including machine learning, feature selection, and support vector machines. He has published papers on data integration of microarray data and structure prediction of HIV coreceptors. He hopes to continue a career in data mining and cloud computing.
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
AI Based Search Engine for Locating Desirable Open Educational ResourcesIshan Abeywardena, Ph.D.
Location of Desirable Open Educational Resources in the Commonwealth Connect Portal Directory of Open Educational Resources (DOER) using the OERScout Artificial Intelligence Based Search Engine.
A Survey on Decision Support Systems in Social MediaEditor IJCATR
Web 3.0 is the upcoming phase in web evolution. Web 3.0 will be about “feeding you the information that you want, when you want it” i.e. personalization of the web. In web 3.0 the basic principle is linking, integrating and analyzing data from various data sources into new information streams by means of semantic technology. So, we can say that Web 3.0 comprises of two platforms semantic technologies and social computing environment. Recommender system is a subclass of decision support system. Recommendations in social web are used to personalize the web [20]. Social Tagging System is one type of social media. In this paper we present the survey of various recommendations in Social Tagging Systems (STSs) like tag, item, user and unified recommendations along with semantic web and also discussed about major thrust areas of research in each category.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEMijnlc
Each argument begins with a conclusion, which is followed by one or more premises supporting the
conclusion. The warrant is a critical component of Toulmin's argument model; it explains why the premises
support the claim. Despite its critical role in establishing the claim's veracity, it is frequently omitted or left
implicit, leaving readers to infer. We consider the problem of producing more diverse and high-quality
warrants in response to a claim and evidence. To begin, we employ BART [1] as a conditional sequence tosequence language model to guide the output generation process. On the ARCT dataset [2], we fine-tune
the BART model. Second, we propose the Multi-Agent Network for Warrant Generation as a model for
producing more diverse and high-quality warrants by combining Reinforcement Learning (RL) and
Generative Adversarial Networks (GAN) with the mechanism of mutual awareness of agents. In terms of
warrant generation, our model generates a greater variety of warrants than other baseline models. The
experimental results validate the effectiveness of our proposed hybrid model for generating warrants.
Contextualized versus Structural Overlapping Communities in Social Media. Mohsen Shahriari
This slide deck summarizes research on detecting overlapping community structures in social media networks. It discusses challenges in detecting contextualized overlapping communities based on post content. The key research questions are how structural community detection is affected by adding contextual similarities between users, and whether combining content and structure can improve detection performance. Several baseline structural detection algorithms are described, including DMID, SLPA, SSK, and CLIZZ. The presentation then proposes two content-based detection methods, CFOCA and TCMA, and combining content and structural values.
An Introduction to Information Retrieval and Applicationssathish sak
An Introduction to Information Retrieval and Applications The score you get depends on the functions, difficulty and quality of your project
For system development:
System functions and correctness
For academic paper presentation
Quality and your presentation of the paper
Major methods/experimental results *must* be presented
Papers from top conferences are strongly suggested
E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, …
Proposals are *required* for each team, and will be counted in the score
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Weka is a popular suite of machine learning software written in Java. It was developed at the University of Waikato in New Zealand and is free to use under the GNU license. Weka allows users to preprocess and analyze data, build models, and perform predictive analytics. It includes an easy to use graphical user interface that allows users to load datasets, run machine learning algorithms, and evaluate experimental results.
Empirical Study on Classification Algorithm For Evaluation of Students Academ...iosrjce
Data mining techniques (DMT) are extensively used in educational field to find new hidden patterns
from student’s data. In recent years, the greatest issues that educational institutions are facing the unstable
expansion of educational data and to utilize this information data to progress the quality of managerial
decisions. Educational institutions are playing a prominent role in the public and also playing an essential role
for enlargement and progress of nation. The idea is predicting the paths of students, thus identifying the student
achievement. The data mining methods are very useful in predicting the educational database. Educational data
mining is concerns with improving techniques for determining knowledge from data which comes from the
educational database. However it has issue with accuracy of classification algorithms. To overcome this
problem the higher accuracy of the classification J48 algorithm is used. This work takes consideration with the
locality and the performance of the student in education in order to analyse the student achievement is high over
schooling or in graduation
Abstract - Various aspects of three proposed architectures for distributed software are examined. A Crucial need to
create an ideal model for optimal architecture which meets the needs of the organization for flexibility, extensibility
and integration, to fulfill exhaustive performance for potential talents processes and opportunities in the corporations
a permanent and ongoing need. The excellence of the proposed architecture is demonstrated by presenting a rigor scenario based proof of adaptively and compatibility of the architecture in cases of merging and varying organizations, where the whole structure of hierarchies is revised.
Keywords: ERP, Data-centric architecture, architecture Component-based, Plug in architecture, distributed systems
A scalable, lexicon based technique for sentiment analysisijfcstjournal
Rapid increase in the volume of sentiment rich social media on the web has resulted in an increased
interest among researchers regarding Sentimental Analysis and opinion mining. However, with so much
social media available on the web, sentiment analysis is now considered as a big data task. Hence the
conventional sentiment analysis approaches fails to efficiently handle the vast amount of sentiment data
available now a days. The main focus of the research was to find such a technique that can efficiently
perform sentiment analysis on big data sets. A technique that can categorize the text as positive, negative
and neutral in a fast and accurate manner. In the research, sentiment analysis was performed on a large
data set of tweets using Hadoop and the performance of the technique was measured in form of speed and
accuracy. The experimental results shows that the technique exhibits very good efficiency in handling big
sentiment data sets.
Towards Automatic Analysis of Online Discussions among Hong Kong StudentsCITE
HU, Xiao (University of Hong Kong)
http://citers2013.cite.hku.hk/en/paper_619.htm
---------------------------
Author(s) bear(s) the responsibility in case of any infringement of the Intellectual Property Rights of third parties.
---------------------------
CITE was notified by the author(s) that if the presentation slides contain any personal particulars, records and personal data (as defined in the Personal Data (Privacy) Ordinance) such as names, email addresses, photos of students, etc, the author(s) have/has obtained the corresponding person's consent.
Named Entity Recognition using Tweet SegmentationIRJET Journal
This document summarizes three research papers on named entity recognition (NER) in tweets. The first paper describes a system called TwiNER that performs NER on targeted Twitter streams to understand user opinions expressed in tweets about organizations. The second paper studies the challenges of NER in tweets due to their terse nature and presents a distantly supervised approach using Labeled LDA that improves NER performance. The third paper proposes modeling user interests in Twitter by extracting named entities from tweets using an unsupervised segmentation approach to avoid large annotation overhead.
Presents Natural Language Processing (NLP) algorithms for for Bay Area NLP reading group. Survey of Probabilistic Topic Modeling such as Latent Dirichlet Allocation (LDA). Includes practical references explaining the algorithm along with software libraries for Python, Spark, and R.
This talk features the basics behind the science of Information Retrieval with a story-mode on information and its various aspects. It then takes you through a quick journey into the process behind building of the search engine.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
Ontology Tutorial: Semantic Technology for Intelligence, Defense and SecurityBarry Smith
Dr. Barry Smith is the director of the National Center for Ontological Research. He discussed how semantic technology can help solve the problem of data silos by enabling data from different sources to be integrated and analyzed together. Ontologies, or controlled vocabularies, can be used to semantically enhance data by tagging it in an interoperable way. This allows the data to be retrieved, understood, and used by others even if they were not involved in creating the data. The semantic enhancement approach aims to break down silos incrementally by coordinating the creation of ontologies and linking datasets through shared terms.
Cyworld Jeju 2009 Conference(10 Aug2009)No2(2)SangMe Nam
This document summarizes a research paper that analyzed comments on South Korean politicians' profiles on the social networking site Cyworld. The researchers collected over 200 random comments and categorized them as positive, negative, or irrelevant. They then used machine learning algorithms like naive Bayes and support vector machines to automatically classify new comments. The algorithms achieved accuracies around 70%, outperforming a model that labeled all comments as the most common class. The tools developed in this research could help analyze political communication on social networks and public sentiment toward politicians.
This document summarizes research on building a serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo Answers. It describes extracting entities and relationships between them to build entity networks. It then details using a random walk retrieval algorithm and rank aggregation to perform searches across the networks. The researchers analyze the system's precision, MAP, and ability to provide unexpected yet relevant results. User studies found the combined system provided more relevant, interesting, and informative results compared to using Wikipedia or Yahoo Answers individually. Metadata like sentiment, readability and categories was added to entity networks to help promote serendipity.
An Abridged Version of My Statement of Research Interestsadil raja
The document summarizes the research interests of Muhammad Adil Raja in machine learning theory and applications. Specifically, his interests involve (1) developing a thorough understanding of machine learning methods, algorithms, and subdomains and (2) effectively applying machine learning to solve real-world problems. As part of his PhD research, he developed methods for real-time, non-intrusive speech quality estimation for VoIP calls using genetic programming. His research resulted in several publications and honors and he continues researching applications of machine learning.
Kuan-ming Lin is interested in data mining, particularly mining biological databases, web documents, and the semantic web. He has skills in data mining techniques including machine learning, feature selection, and support vector machines. He has published papers on data integration of microarray data and structure prediction of HIV coreceptors. He hopes to continue a career in data mining and cloud computing.
Language Models for Information RetrievalDustin Smith
The document provides background information on Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze, who are authors of the book "Introduction to Information Retrieval: Language models for information retrieval". It then outlines the presentation which discusses language models for information retrieval, including query likelihood models, estimating query generation probabilities, and experiments comparing language modeling approaches to other IR techniques.
AI Based Search Engine for Locating Desirable Open Educational ResourcesIshan Abeywardena, Ph.D.
Location of Desirable Open Educational Resources in the Commonwealth Connect Portal Directory of Open Educational Resources (DOER) using the OERScout Artificial Intelligence Based Search Engine.
A Survey on Decision Support Systems in Social MediaEditor IJCATR
Web 3.0 is the upcoming phase in web evolution. Web 3.0 will be about “feeding you the information that you want, when you want it” i.e. personalization of the web. In web 3.0 the basic principle is linking, integrating and analyzing data from various data sources into new information streams by means of semantic technology. So, we can say that Web 3.0 comprises of two platforms semantic technologies and social computing environment. Recommender system is a subclass of decision support system. Recommendations in social web are used to personalize the web [20]. Social Tagging System is one type of social media. In this paper we present the survey of various recommendations in Social Tagging Systems (STSs) like tag, item, user and unified recommendations along with semantic web and also discussed about major thrust areas of research in each category.
A Novel approach for Document Clustering using Concept ExtractionAM Publications
In this paper we present a novel approach to extract the concept from a document and cluster such set of documents depending on the concept extracted from each of them. We transform the corpus into vector space by using term frequency–inverse document frequency then calculate the cosine distance between each document, followed by clustering them using K means algorithm. We also use multidimensional scaling to reduce the dimensionality within the corpus. It results in the grouping of documents which are most similar to each other with respect to their content and the genre.
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEMijnlc
Each argument begins with a conclusion, which is followed by one or more premises supporting the
conclusion. The warrant is a critical component of Toulmin's argument model; it explains why the premises
support the claim. Despite its critical role in establishing the claim's veracity, it is frequently omitted or left
implicit, leaving readers to infer. We consider the problem of producing more diverse and high-quality
warrants in response to a claim and evidence. To begin, we employ BART [1] as a conditional sequence tosequence language model to guide the output generation process. On the ARCT dataset [2], we fine-tune
the BART model. Second, we propose the Multi-Agent Network for Warrant Generation as a model for
producing more diverse and high-quality warrants by combining Reinforcement Learning (RL) and
Generative Adversarial Networks (GAN) with the mechanism of mutual awareness of agents. In terms of
warrant generation, our model generates a greater variety of warrants than other baseline models. The
experimental results validate the effectiveness of our proposed hybrid model for generating warrants.
Contextualized versus Structural Overlapping Communities in Social Media. Mohsen Shahriari
This slide deck summarizes research on detecting overlapping community structures in social media networks. It discusses challenges in detecting contextualized overlapping communities based on post content. The key research questions are how structural community detection is affected by adding contextual similarities between users, and whether combining content and structure can improve detection performance. Several baseline structural detection algorithms are described, including DMID, SLPA, SSK, and CLIZZ. The presentation then proposes two content-based detection methods, CFOCA and TCMA, and combining content and structural values.
An Introduction to Information Retrieval and Applicationssathish sak
An Introduction to Information Retrieval and Applications The score you get depends on the functions, difficulty and quality of your project
For system development:
System functions and correctness
For academic paper presentation
Quality and your presentation of the paper
Major methods/experimental results *must* be presented
Papers from top conferences are strongly suggested
E.g. SIGIR, WWW, CIKM, WSDM, JCDL, ICMR, …
Proposals are *required* for each team, and will be counted in the score
Profile-based Dataset Recommendation for RDF Data Linking Mohamed BEN ELLEFI
This document summarizes Mohamed Ben Ellefi's PhD thesis defense on profile-based dataset recommendation for RDF data linking. The thesis proposes two approaches: a topic profile-based approach and an intensional profile-based approach. The topic profile-based approach models datasets as topics and recommends target datasets based on similarity between source and target topic profiles, achieving an average recall of 81% and reducing the search space by 86%. The approach shows better performance than baselines but needs improvement on precision.
Weka is a popular suite of machine learning software written in Java. It was developed at the University of Waikato in New Zealand and is free to use under the GNU license. Weka allows users to preprocess and analyze data, build models, and perform predictive analytics. It includes an easy to use graphical user interface that allows users to load datasets, run machine learning algorithms, and evaluate experimental results.
Empirical Study on Classification Algorithm For Evaluation of Students Academ...iosrjce
Data mining techniques (DMT) are extensively used in educational field to find new hidden patterns
from student’s data. In recent years, the greatest issues that educational institutions are facing the unstable
expansion of educational data and to utilize this information data to progress the quality of managerial
decisions. Educational institutions are playing a prominent role in the public and also playing an essential role
for enlargement and progress of nation. The idea is predicting the paths of students, thus identifying the student
achievement. The data mining methods are very useful in predicting the educational database. Educational data
mining is concerns with improving techniques for determining knowledge from data which comes from the
educational database. However it has issue with accuracy of classification algorithms. To overcome this
problem the higher accuracy of the classification J48 algorithm is used. This work takes consideration with the
locality and the performance of the student in education in order to analyse the student achievement is high over
schooling or in graduation
Abstract - Various aspects of three proposed architectures for distributed software are examined. A Crucial need to
create an ideal model for optimal architecture which meets the needs of the organization for flexibility, extensibility
and integration, to fulfill exhaustive performance for potential talents processes and opportunities in the corporations
a permanent and ongoing need. The excellence of the proposed architecture is demonstrated by presenting a rigor scenario based proof of adaptively and compatibility of the architecture in cases of merging and varying organizations, where the whole structure of hierarchies is revised.
Keywords: ERP, Data-centric architecture, architecture Component-based, Plug in architecture, distributed systems
The document provides details on developing a new mobile application for Metrolink transit. It discusses weaknesses in the current application and outlines features and justifications for the new application being developed. The key points are:
1) The new application aims to remove unnecessary features from the current Metrolink app and add useful new functionality like ticket purchasing that users want.
2) Features like real-time updates, multiple ticket purchasing, QR codes for tickets, and alarms for arrival times are described and their benefits outlined.
3) Making the app free, reducing time spent on tasks, and providing relevant location-based information are emphasized as ways to enhance usability and encourage more users.
This document discusses the data cleaning process for a dataset combining Reddit news headlines and Dow Jones Industrial Average (DJIA) values. The dataset contained issues like special characters in headlines and multiple word forms referring to the same concept. To address this, the data was cleaned by removing special characters, stemming words to their root forms, and removing stopwords. Keywords were identified by creating a corpus from the cleaned headlines and calculating term frequencies. Two frequency ranges containing about 50 terms each were selected for further analysis against the DJIA rise/fall classification attribute. The goal of the data cleaning was to extract keyword counts that could help predict the DJIA classification.
This document describes a data mining project to analyze bank marketing data using classification algorithms. The goal is to predict whether a client will subscribe to a term deposit based on their attributes. The dataset contains information on over 61,000 customers. Pre-processing steps like data cleaning, handling missing values, removing outliers, and scaling are performed. Algorithms like Naive Bayes, J48 decision trees, and PART are applied and their results compared to determine the best predictor of subscriptions.
Classification and Clustering Analysis using Weka Ishan Awadhesh
This Term Paper demonstrates the classification and clustering analysis on Bank Data using Weka. Classification Analysis is used to determine whether a particular customer would purchase a Personal Equity PLan or not while Clustering Analysis is used to analyze the behavior of various customer segments.
The document is a group report for redesigning the Metrolink mobile app. It includes:
- An executive summary outlining flaws in the current app and the goals of redesigning it.
- Details of the methodology used including gathering user requirements, collaborating on ideas, creating navigation diagrams and prototypes.
- Usability evaluations were conducted on the prototypes including SUS surveys, Nielsen's heuristics analysis and heuristic statements.
- The final redesigned app is presented through storyboards and a high-level prototype, addressing user needs through new features like smart journeys, location-based notifications and a rewards system.
This document discusses developing classifiers for a census income dataset using the WEKA data mining tool. It summarizes preprocessing steps like handling missing values, removing outliers, and balancing class distributions. It then evaluates various classifiers like J48 decision trees and k-nearest neighbors (IBk) on the preprocessed data. The best performing model was an ensemble "vote" classifier that combined the predictions of J48, IBk, and logistic regression models, achieving 87.3% accuracy and an ROC area of 0.947.
This document discusses a project analyzing a mushroom dataset containing descriptions of 23 species of gilled mushrooms to determine if they are edible or poisonous. The dataset includes 8,124 instances with 22 attributes describing intrinsic mushroom characteristics and external factors. A decision tree model achieved 100% accuracy in classification by considering attributes like odor, habitat, population density, and cap/stalk features. Further analysis found odor and spore print color were the most statistically significant but not practical for identification in the field. Future work aims to develop easier visual classification methods using attributes like habitat, population density, and cap/stalk characteristics.
Dialectal Arabic sentiment analysis based on tree-based pipeline optimizatio...IJECEIAES
This document summarizes a research paper that proposes using a tree-based pipeline optimization tool (TPOT) to improve sentiment classification of dialectal Arabic texts. The paper provides background on sentiment analysis and challenges in analyzing informal Arabic texts. It then discusses related work applying TPOT and AutoML techniques to optimize machine learning for various tasks. The proposed approach uses TPOT for sentiment analysis of three Arabic dialect datasets to automatically optimize hyperparameters and improve over similar prior work.
A Framework for Arabic Concept-Level Sentiment Analysis using SenticNet IJECEIAES
This document presents a framework for Arabic concept-level sentiment analysis using SenticNet. It discusses existing sentiment analysis approaches and focuses on concept-level sentiment analysis, which classifies text based on semantics rather than syntax. The authors modify SenticNet to suit Arabic and test it on a multi-domain Arabic dataset. Syntactic patterns are used to extract concepts from sentences, which are then translated to English and matched to SenticNet concepts to determine polarity. An accuracy of 70% was obtained when testing the generated lexicon on the dataset. The lexicon containing 69k unique concepts covers reviews from multiple domains and is made publicly available.
IRJET- Real Time Sentiment Analysis of Political Twitter Data using Machi...IRJET Journal
This document summarizes a research paper that analyzed sentiments of political tweets related to the Ayodhya issue in India using machine learning. It collected tweets using keywords and preprocessed them by removing URLs, usernames, stop words, and irrelevant data. It then extracted sentiment-bearing words as features. It classified the polarity of each tweet as positive, negative, or neutral using the Vader sentiment analysis tool and calculated overall sentiment scores. It aimed to analyze public opinion on the Ayodhya issue expressed on Twitter.
Automatic Movie Rating By Using Twitter Sentiment Analysis And Monitoring ToolLaurie Smith
This document summarizes a research article that developed an automatic movie rating system using Twitter sentiment analysis and monitoring tool. The researchers collected Turkish tweets with hashtags of movie titles and classified them as positive, negative or neutral sentiments using machine learning models. They trained models like Naive Bayes, Support Vector Machines, Random Forest and achieved the best accuracy of 87% using a voting algorithm and best AUC of 0.96 using SVM. They developed a web application using Flask to monitor sentiment scores in real-time via hashtags. The study aimed to help users make choices by evaluating emotions expressed about movies on social media.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
This document summarizes a research paper that proposes using a logistic regression classifier trained with stochastic gradient descent to predict Twitter users' personalities from their tweets. It begins with an abstract of the paper and an introduction on personality prediction from social media. It then provides more detail on the anatomy of the research, including defining personality prediction from Twitter, its applications, and the general process of using machine learning for the task. Next, it reviews several previous studies on personality prediction from Twitter and social networks, noting their approaches, findings and limitations. It identifies remaining research gaps, such as the need for improved linguistic analysis of tweets and more robust/scalable predictive models. Finally, it proposes using a logistic regression classifier as the personality prediction model to address
A HYBRID CLASSIFICATION ALGORITHM TO CLASSIFY ENGINEERING STUDENTS’ PROBLEMS ...IJDKP
The social networking sites have brought a new horizon for expressing views and opinions of individuals.
Moreover, they provide medium to students to share their sentiments including struggles and joy during the
learning process. Such informal information has a great venue for decision making. The large and growing
scale of information needs automatic classification techniques. Sentiment analysis is one of the automated
techniques to classify large data. The existing predictive sentiment analysis techniques are highly used to
classify reviews on E-commerce sites to provide business intelligence. However, they are not much useful
to draw decisions in education system since they classify the sentiments into merely three pre-set
categories: positive, negative and neutral. Moreover, classifying the students’ sentiments into positive or
negative category does not provide deeper insight into their problems and perks. In this paper, we propose
a novel Hybrid Classification Algorithm to classify engineering students’ sentiments. Unlike traditional
predictive sentiment analysis techniques, the proposed algorithm makes sentiment analysis process
descriptive. Moreover, it classifies engineering students’ perks in addition to problems into several
categories to help future students and education system in decision making.
Sentiment Analysis and Classification of Tweets using Data MiningIRJET Journal
This document summarizes research on using data mining techniques to perform sentiment analysis on tweets. The researchers collected tweets from Twitter and preprocessed the text to make it usable for building sentiment classifiers. They used three classifiers - K-Nearest Neighbor, Naive Bayes, and Decision Tree - and compared the results to determine which provided the best accuracy. Rapid Miner tool was used to preprocess the text, build the classifiers, and analyze the results. The goal was to determine people's sentiments expressed in their tweets and correctly classify them.
Irjet v4 i73A Survey on Student’s Academic Experiences using Social Media Data53IRJET Journal
This document summarizes a research paper that analyzed social media posts by engineering students to understand their academic experiences. It developed a workflow that integrated qualitative analysis and data mining techniques. By applying classification algorithms to tweets with hashtags, it identified three main problems engineering students discussed: heavy study loads, lack of social engagement, and sleep deprivation. The analysis revealed students struggled with managing heavy course loads, which led to other issues affecting their health, motivation, and emotions. It also found posts discussing diversity issues and cultural stereotypes among engineering students.
The document proposes a multi-tier sentiment analysis system called MSABDP to analyze large-scale social media data more efficiently. MSABDP uses Hadoop for its distributed processing and storage capabilities. It collects Twitter data using Apache Flume and stores it in HDFS. It then applies a multi-tier classification approach combining lexicon-based and machine learning techniques to classify tweets into multiple sentiment classes, reducing complexity compared to single-tier architectures. Evaluation on real Twitter data showed MSABDP improved classification accuracy over single-tier approaches by 7%.
Online social network analysis with machine learning techniquesHari KC
This document summarizes a student's final project presentation on analyzing social networks using machine learning techniques. The objectives of the project were to determine features and sentiment from tweets using natural language processing, evaluate machine learning classifiers' accuracy, and compare the popularity of two smartphones. The methodology involved extracting tweets from Twitter using its API, selecting features for analysis, training and testing classifiers, and calculating sentiment indexes and relative strengths to determine popularity. Results showed classifiers' accuracies, important features, sentiment analysis plots comparing the two smartphones, and that Android had a higher sentiment index and relative strength than iOS based on Twitter data analysis.
Sentiment Analysis of Twitter tweets using supervised classification technique IJERA Editor
Making use of social media for analyzing the perceptions of the masses over a product, event or a person has
gained momentum in recent times. Out of a wide array of social networks, we chose Twitter for our analysis as
the opinions expressed their, are concise and bear a distinctive polarity. Here, we collect the most recent tweets
on users' area of interest and analyze them. The extracted tweets are then segregated as positive, negative and
neutral. We do the classification in following manner: collect the tweets using Twitter API; then we process the
collected tweets to convert all letters to lowercase, eliminate special characters etc. which makes the
classification more efficient; the processed tweets are classified using a supervised classification technique. We
make use of Naive Bayes classifier to segregate the tweets as positive, negative and neutral. We use a set of
sample tweets to train the classifier. The percentage of the tweets in each category is then computed and the
result is represented graphically. The result can be used further to gain an insight into the views of the people
using Twitter about a particular topic that is being searched by the user. It can help corporate houses devise
strategies on the basis of the popularity of their product among the masses. It may help the consumers to make
informed choices based on the general sentiment expressed by the Twitter users on a product.
This document provides a review of techniques, tools, and platforms for analyzing social media data. It discusses the types of social media data and formats available, as well as tools for accessing, cleaning, analyzing, and visualizing social media data. Some key challenges of social media research are the restricted access to comprehensive data sources, lack of tools for in-depth analysis without programming, and need for large data storage and computing facilities to support research at scale. The document provides a methodology and critique of current approaches and outlines requirements to better support social media research.
An Analytical Survey on Hate Speech Recognition through NLP and Deep LearningIRJET Journal
The document discusses hate speech recognition through natural language processing and deep learning. It provides an overview of past work on hate speech detection using various techniques like pattern-based methods, multi-class text classification, hybrid neural networks, convolutional neural networks and feature-based machine learning models. The literature review analyzes these previous studies and their methods and limitations to help develop an improved methodology for hate speech recognition.
The document describes the process for conducting a content analysis and developing taxonomies from academic articles. It involves identifying topics and keywords, searching relevant databases and journals with limits, downloading abstracts, analyzing content, identifying themes, and consolidating the analysis into a taxonomy or classification system. An example taxonomy is provided that categorizes 295 analyzed articles based on various themes, such as unit of analysis, area of study, theories used, methodology, and origin of study. The purpose is to systematically classify and organize knowledge within a research domain.
This document reviews research on predicting personality from Twitter users' tweets using machine learning algorithms. It discusses how tweets have attracted research interest from diverse fields. Different techniques have been used to predict personality from tweets, but there are still shortcomings to address. The aim is to consider the current state of this research area and explore personality prediction from tweets by reviewing past literature and discussing approaches to issues researchers face. It provides an overview of machine learning methods used for personality prediction from tweets, including data collection, preprocessing, model training and evaluation.
Similar to Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance Evaluation (20)
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
Leveraging Generative AI to Drive Nonprofit InnovationTechSoup
In this webinar, participants learned how to utilize Generative AI to streamline operations and elevate member engagement. Amazon Web Service experts provided a customer specific use cases and dived into low/no-code tools that are quick and easy to deploy through Amazon Web Service (AWS.)
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
Andreas Schleicher presents PISA 2022 Volume III - Creative Thinking - 18 Jun...EduSkills OECD
Andreas Schleicher, Director of Education and Skills at the OECD presents at the launch of PISA 2022 Volume III - Creative Minds, Creative Schools on 18 June 2024.
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
Naïve Bayes and J48 Classification Algorithms on Swahili Tweets: Performance Evaluation
1. (IJCSIS) International Journal of Computer Science and Information Security,
Vol. 14, No. 1, January 2016
Naïve Bayes and J48 Classification Algorithms on
Swahili Tweets: Perfomance Evaluation
Hassan Seif
College of Informatics and Virtual Education
University of Dodoma
Dodoma, Tanzania
Abstract—The use of social media has grown significantly due to
evolution of web 2.0 technologies. People can share the ideas,
comments and posting any events. Twitter is among of those
social media sites. It contains very short message created by
registered users. Twitter has played the important parts in many
events by sharing message posted by registered user. This study
aims on evaluating performance of Naïve Bayes and J48
Classification algorithms on Swahili tweets. Swahili is among of
the African language that is growing faster and is receiving a
wide attention in web usage through social networks, blogs,
portals etc. To the best of the researcher’s knowledge; many
studies have been conducted on other language for comparing
classification algorithms, but no similar studies found on Swahili
language. The data of this study was collected from the top ten
most popular twitter accounts in Tanzania using Nodexl. These
accounts were identified according to the number of followers.
The extracted data were pre-processed in order to remove noise,
incomplete data, outlier, inconsistent data, symbols etc. Further,
the tweets contains words which are not in Swahili language were
identified and removed and filtered by removing url links and
twitter user names. The pre-processed data analysed on WEKA
using Naïve Bayes and J48 classification algorithms. The
algorithm then evaluated based on their accuracy, precision,
recall and Receiver Operator Characteristic (ROC). It has been
found that; Naïve Bayes classification algorithms perform better
on Swahili tweets compared to J48 classification algorithm.
Keywords-Social media; Swahili tweets; Naïve Bayes; J48
I. INTRODUCTION
Due to the evolution of web 2.0 technologies, now days the
use of social media sites has grown significantly. People
communicate, posting their comments and views through social
media sites depending on their interest/opinions. It is estimated
that there are over 900 social media sites on the internet with
more popular platforms like Facebook, Twitter, LinkedIn,
Google Plus, and YouTube [1].
Twitter is a popular and massive social networking site
which has a large number of very short messages created by the
registered users. It is estimated that; there are about more than
140 million active users who publish over 400 million 140-
character “Tweets” every day [2]. The large speed and ease of
publication of Twitter have made it as an important
communication medium for people. Twitter has played a
prominent role in socio-political events and also has been used
to post damage reports and disaster preparedness information
during large natural disasters, such as the Hurricane Sandy [2].
The data posted on twitter can be used for various research
purposes. In context of data mining, there are two fundamental
tasks that can be considered in conjunction with Twitter data:
(a) graph mining based on analysis of the links amongst
messages, and (b) text mining based on analysis of the
messages' actual text [3]. Twitter graph mining based on
analysis of the links amongst message can be applied in
measuring user influence and dynamics of popularity,
community discovery and formation and social information
diffusion. On twitter text mining based on analysis of actual
message, the number of task which can be performed includes;
sentiment analysis, classification of tweets into categories,
clustering of tweets and trending topic detection [3], this study
is based on classification of tweets into categories; where by
algorithms used was to be compared.
Swahili is among of the African language that is growing
faster and is receiving a wide attention in web usage through
social networks, blogs, portals etc. It is spoken in several
countries found in Africa such as; Tanzania, Kenya, Uganda,
Burundi, DRC Congo, Rwanda, Mozambique and Somalia;
and has about 50 million speakers. There are four categories of
African languages namely: Khoisan, Afro-Asiatic, Nilo-
Saharan and Niger–Congo Kordofanian. Swahili belongs to the
Niger- Congo group of languages specifically the Sabaki
subgroup of Northeastern Coast Bantu languages [4].
To the best of the researcher’s knowledge several studies
has been conducted for comparing classification algorithms,
but many of them are based in English and other languages.
There are no similar studies on Swahili language. Furthermore,
there are no set of corpus of Swahili tweets which are ready
made publicity available for research purpose. For this reason it
can be stated that, Swahili is among of the under-resourced
language. The term “under-resourced language” refers to a
language with some of (if not all) the following aspects: lack of
a unique writing system or stable orthography, limited presence
on the web, lack of linguistic expertise, lack of electronic
resources for speech and language processing, such as
monolingual corpora, bilingual electronic dictionaries,
transcribed speech data, pronunciation dictionaries, vocabulary
lists, etc [5].
1 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
2. (IJCSIS) International Journal of Computer Science and Information Security,
Vol. 14, No. 1, January 2016
This study is intended to compare the performance of Naïve
Bayes and J48 Classification algorithm on Swahili tweets.
II. NAÏVE BAYES
Naïve Bayes is a simple classifier based on the Bayes
theorem. It is a statistical classifier which performs
probabilistic prediction. The classifier works by assuming that;
the attribute are conditionally independent.
For Naïve Bayes classification, the following equation is
used [6] ;
( ) ( )
( )
( )
P X Ci P Ci
P Ci X
P X
⏐
⏐ = (1)
From equation (1) above, the classifier, or simple Bayesian
classifier, work as follows;
(1) Let D be a training set of tuples and their associated
class labels. Each tuple is represented by an n-
dimensional attribute vector, X= (X1, X2..........., Xn) ,
depicting n measurements made on the tuple from n
attributes, respectively, A1, A2..............., An .
(2) Suppose that there are m classes, C1,C2.........Cm .
Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior
probability, conditioned on X. That is, the Naïve
Bayesian classifier predicts that tuple X belongs to the
class Ci if and only if P(Ci | X) > P(Cj | X) for 1 ≤ j
≤ m; j≠ 1. Thus we maximize P(Ci | X) . The class Ci
for which P(Ci | X) is maximized is called the
maximum posteriori hypothesis.
(3) From equation (1), as P(X) is constant for all classes,
only P(X | Ci)P(Ci) need be maximized. Then
predicts data item X belongs to class Ci if and only if
has got the highest probability compared to other class
label.
III. J48
J48 is one of the decision tree induction algorithm .It is an
open source Java implementation of the C4.5 algorithm in the
WEKA data mining tool [7]. This algorithm was developed by
Ross Quinlan. C4.5 algorithm creates a decision tree which can
be used for classification based the value which are presented
on dataset. The following steps are used while the decision tree
is constructed on J48 classification algorithm;
(1) In general the tree is constructed in a top-down
recursive divide-and-conquer manner, at start, all the
training examples are at the root, attributes are
categorical (if continuous-valued, they are discretized
in advance), examples are partitioned recursively
based on selected attributes test attributes are selected
on the basis of a heuristic or statistical measure (e.g.,
information gain)
(2) Conditions for stopping partitioning are as follows; all
samples for a given node belong to the same class,
there are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf, there are no samples left.
IV. RELATED WORKS
A review of literature from various scholars reveals that
there are number of studies which were conducted for
comparing several classification algorithms.
Goyal, A and Mehta, R [8] conduct a study on comparative
evaluation of Naïve Bayes and J48 classification algorithms.
The study was in the context of financial institute dataset with
the aim of checking accuracy and cost analysis of these
algorithms by maximizing true positive rate and minimizing
false positive rate of defaulters using WEKA tool. The result
showed that; the efficiency and accuracy of J48 and Naive
Bayes is good [8].
Another study was conducted by Arora, R and Suman [9]
which check comparative analysis on classification algorithms
on different datasets using WEKA. The comparison was
conducted on two algorithms; J48 and Multilayer Perceptron
(MLP). The performance of these algorithms have been
analysed so as to choose the better algorithm based on the
conditions of the datasets. J48 is based on C4.5 decision based
learning and MLP algorithm uses the multilayer feed forward
neural network approach for classification of datasets. It has
been found that; MLP has better performance than J48
algorithm.
Patil, Tina R and Sherekar, S S [10] did the study on
comparing performance of J48 and Naïve Bayes classification
algorithm based on bank dataset to maximize true positive rate
and minimize false positive rate of defaulters rather than
achieving only higher classification accuracy using WEKA
tool. The study found that; the efficiency and accuracy of J48 is
better than that of Naïve Bayes.
Furthermore a comparative analysis of classification
algorithms for students’ college enrollment approval using data
mining had been conducted using dataset from King Abdulaziz
University database. In this study; the WEKA knowledge
analysis tool is used for simulation of practical measurements.
The classification technique that has the potential to
significantly improve the performance is suggested for use in
colleges’ admission and enrollment applications. It has been
found that; C4.5, PART and Random Forest algorithms give
the highest performance and accuracy with lowest errors while
IBK-E and IBK-M algorithms give high errors and low
accuracy [11].
V. METHODOLOGY
A. Data Set Collection
The dataset of this study was collected from the top ten
most popular twitter accounts in Tanzania using Nodexl. These
accounts were identified according to their number of followers
as presented in socialbakers sites [12]. Hot topics with their
comments were identified and extracted using Nodexl
2 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
3. (IJCSIS) International Journal of Computer Science and Information Security,
Vol. 14, No. 1, January 2016
software. The collected data were stored in a CSV format for
easy to be analysed in WEKA software.
B. Data Preprocessing
This is one of the most important steps in data mining.
Since no quality data, no quality mining result. Some of data
preprocessing techniques are data cleaning, data integration,
data transformation, data reduction and data discretization [6].
These techniques may be combined together as a stage of data
preprocessing.
The data of this study were cleaned in order to remove
noise data, incomplete data, outlier, inconsistent data, symbols
etc. Also, the tweets contains words which are not in Swahili
language were identified. The words further are filtered by
removing url links, and twitter user names. Finally the
words/tweets which are not in Swahili language were removed
this is because there are some tweets which was found to be in
mixed language (Swahili and English) and other in English
only.
C. Data Analysis
The pre-processed data were analysed by using WEKA
software. "WEKA" stands for the Waikato Environment for
Knowledge Analysis, which was developed by the University
of Waikato in New Zealand. It is open source software issued
under the GNU General Public License. WEKA has a
collection of machine learning algorithms for data mining task.
It has techniques for data pre-processing, classification,
regression, clustering, association rules, visualization etc. It is
written in Java and runs on almost every platform. It is also
well-suited for developing new machine learning schemes. The
tool gathers a comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods, graphical user
interfaces (incl. data visualization) and environment for
comparing learning algorithms. WEKA is easy to use and to be
applied at several different levels.
WEKA has been selected because the Naïve Bayes and J48
Classification algorithm are implemented in this tool. This
would results in achieving the objective of the study which is to
compare the performance of Naïve Bayes and J48 classification
algorithm on Swahili tweets.
D. Model Evaluation
After analyzing the data on WEKA, each algorithm was
compared on their performance. Performance evaluations were
based on recall, precision, accuracy and ROC curve. The
formula used for evaluating these algorithms based on the
following confusion matrix as described in Table 1 ;
Table 1: Confussion Matrix
Detected
Positive Negative
Actual Positive A:True Positive B:False Negative
Negative C:False Positive D: True Negative
Recall/Sensitivity/True positive rate it is the proportion of
positive cases that were correctly identified. Recall can be
calculated using the following equation:
Recall =
A
A + B
(2)
Precision/Confidence denotes the proportion of Predicted
Positive cases that are correctly Real Positives. Equation (3)
can be used in finding precision;
Precision =
A
A C+
(3)
Accuracy of a classifier on a given test set is the percentage
of test set tuples that are correctly classified by the classifier.
The true positives, true negatives, false positives, and false
negatives are also useful in assessing the costs and benefits (or
risks and gains) associated with a classification model [6]. The
following equation can be used to calculate the accuracy of the
classifier;
Accuracy =
A + D
A + B + C + D
(4)
Receiver Operator Characteristic (ROC) curve is a
graphical method for displaying the tradeoff between true
positive rate and false positive rate of a classifier. True positive
is plotted along the Y-axis and false positive is plotted along
the X-axis. The ROC has got number of properties depending
on the value of its area under the curve. The following describe
the nature of prediction/classification based on the value of
ROC curve area (A);
A= 1.0: perfect prediction
A= 0.9: excellent prediction
A= 0.8: good prediction
A= 0.7: mediocre prediction
A= 0.6: poor prediction
A= 0.5: random prediction
A= <0.5: something wrong
VI. RESULT AND DISCUSSION
Experiments were performed on Swahili tweets data set
which was extracted by using Nodexl. The total number tweets
on a data set were 276 with 5 attributes. These data analysed on
WEKA tool by using Naïve Bayes and J48 Classification
algorithm. Before the data set tested on classification
algorithm, attribute subset selection measure were used in order
to select the best attribute and removing all weak irrelevant
attribute. Since the high dimension data will make testing and
training of general classification methods to be difficult [13].
The Heuristic method used was stepwise forward selection
(Best First) whereby the best of the original attribute is
determined by added to the reduced set of attribute.
The accuracy of the selected algorithms (J48 and Naïve
Bayes) was tested by cross validation method. In this method,
10-fold cross validation was used where by a data set were
3 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
4. (IJCSIS) International Journal of Computer Science and Information Security,
Vol. 14, No. 1, January 2016
randomly partitioned into 10 mutually exclusive folds each of
approximately equal size. Training and testing is performed 10
times. For each iteration, one fold is selected as the test set and
the remaining data in another nine folds used as a training set.
The testing is repeated 10 times. The final accuracy of an
algorithm will be the average of the 10 trials.
Table 2 shows the result of the experiment on Swahili
tweets using Naïve Bayes and J48 classification algorithms on
WEKA
Table 2: Experiment Results
Evaluation
Algorithm
Accuracy Precision Recall ROC
Naïve Bayes 36.96% 0.294 0.37 0.525
J48 34.78% 0.121 0.348 0.461
It has been found that; Naïve Bayes classification
algorithms perform better on Swahili tweets compared to J48
classification algorithm. The model were evaluated by using
accuracy, precision, recall, and ROC; and it has been found
that, Naïve Bayes has the highest accuracy (36.96%) compared
to J48 classification algorithm (34.78%). This implies that; the
total number of instances that are correctly classified by Naïve
Bayes is larger than the total number of instances that are
correctly classified by J48. Furthermore; Naïve Bayes has been
found to be the best in terms Precision (0.294), Recall (0.37)
and ROC (0.525) compared to J48 in terms of Precision
(0.121), Recall (0.348) and ROC (0.461).
VII. CONCLUSION
The Naïve Bayes has been found to be the best
classification algorithm on Swahili tweets data set compared to
J48 classification algorithm in terms of accuracy, precision,
recall and ROC. In general the performance of Naïve Bayes
and J48 algorithm on Swahili tweets was very poor. This is
because; the values of their evaluation measure (accuracy,
precision, recall, and ROC) are very small.
More research should be conducted in order to identify the
best algorithm which will give highest performance in terms of
accuracy, precision, recall, ROC and other evaluation
measures. Further research also should be conducted in order to
find the way on how to increase the performance of both
algorithm (Naïve Bayes and J48) in terms of accuracy,
precision, recall, ROC and other evaluation methods.
REFERENCES
[1] R. C. M. Jr and F. M, “Social Media Analytics : Data
Mining Applied to Insurance Twitter Posts,” Casualty
Actuar. Soc. E-Forum, vol. 2, 2012.
[2] S. Kumar, F. Morstatter, and H. Liu, Twitter Data
Analytics. Springer, 2013.
[3] A. Bifet and E. Frank, “Sentiment knowledge
discovery in Twitter streaming data,” in Discovery
Science, 2010, pp. 1–15.
[4] S. Marjie-okyere, “Borrowings in Texts : A Case of
Tanzanian Newspapers,” New Media Mass Commun.,
vol. 16, no. Marjie 2010, pp. 1–9, 2013.
[5] L. Besacier, E. Barnard, A. Karpov, and T. Schultz,
“Automatic Speech Recognition for Under-Resourced
Languages : A Survey.”
[6] J. Han and M. Kamber, Data Mining Concepts and
Techniques, Second. San Francisco, CA: Morgan
Kaufmann, 2006.
[7] G. Kaur and A. Chhabra, “Improved J48 Classification
Algorithm for the Prediction of Diabetes,” Int. J.
Comput. Appl., vol. 98, no. 22, pp. 13–17, 2014.
[8] A. Goyal and R. Mehta, “Performance Comparison of
Naïve Bayes and J48 Classification Algorithms,” Int.
J. Appl. Eng. Res., vol. 7, no. 11, 2012.
[9] R. Arora and Suman, “Comparative Analysis of
Classification Algorithms on Different Datasets using
WEKA,” Int. J. Comput. Appl., vol. 54, no. 13, pp.
21–25, 2012.
[10] T. R. Patil and S. S. Sherekar, “Performance Analysis
of Naive Bayes and J48 Classification Algorithm for
Data Classification,” Int. J. Comput. Sci. Appl. ISSN
0974-1011, vol. 6, no. 2, pp. 256–261, 2013.
[11] A. H. M. Ragab, A. Y. Noaman, A. S. AL-Ghamd,
and A. I. Madbouly, “A Comparative Analysis of
Classification Algorithms for Students College
Enrollment Approval Using Data Mining,” in
Workshop on Interaction Design in Educational
Environments, 2014, p. 106.
[12] “Most popular Twitter accounts in Tanzania _
Socialbakers.” [Online]. Available:
http://www.socialbakers.com/statistics/twitter/profiles/
tanzania/. [Accessed: 13-Dec-2015].
[13] A. G. Karegowda, A. S. Manjunath, and M. A.
Jayaram, “Comparative Study of Attribute Selection
Using Gain Ratio and Correlation Based Feature
Selection,” Int. J. Inf. Technol. Knowl. Manag., vol. 2,
no. 2, pp. 271–277, 2010.
4 https://sites.google.com/site/ijcsis/
ISSN 1947-5500