This document summarizes a research project that used a naive Bayes classifier to analyze comments on an Optional Practical Training program. The researchers collected over 42,000 comments from an online Department of Homeland Security forum and labeled 900 for training and testing a multinomial naive Bayes model. They achieved 96.5% accuracy by further enhancing the model using an iterative classification maximization algorithm. The results found that 85.17% of comments supported the OPT extension while 14.83% opposed it. Additional analysis on ethnicity found Chinese commenters more supportive than Americans.
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEMijnlc
Each argument begins with a conclusion, which is followed by one or more premises supporting the
conclusion. The warrant is a critical component of Toulmin's argument model; it explains why the premises
support the claim. Despite its critical role in establishing the claim's veracity, it is frequently omitted or left
implicit, leaving readers to infer. We consider the problem of producing more diverse and high-quality
warrants in response to a claim and evidence. To begin, we employ BART [1] as a conditional sequence tosequence language model to guide the output generation process. On the ARCT dataset [2], we fine-tune
the BART model. Second, we propose the Multi-Agent Network for Warrant Generation as a model for
producing more diverse and high-quality warrants by combining Reinforcement Learning (RL) and
Generative Adversarial Networks (GAN) with the mechanism of mutual awareness of agents. In terms of
warrant generation, our model generates a greater variety of warrants than other baseline models. The
experimental results validate the effectiveness of our proposed hybrid model for generating warrants.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...IJECEIAES
Latent Dirichlet Allocation (LDA) is a probability model for grouping hidden topics in documents by the number of predefined topics. If conducted incorrectly, determining the amount of K topics will result in limited word correlation with topics. Too large or too small number of K topics causes inaccuracies in grouping topics in the formation of training models. This study aims to determine the optimal number of corpus topics in the LDA method using the maximum likelihood and Minimum Description Length (MDL) approach. The experimental process uses Indonesian news articles with the number of documents at 25, 50, 90, and 600; in each document, the numbers of words are 3898, 7760, 13005, and 4365. The results show that the maximum likelihood and MDL approach result in the same number of optimal topics. The optimal number of topics is influenced by alpha and beta parameters. In addition, the number of documents does not affect the computation times but the number of words does. Computational times for each of those datasets are 2.9721, 6.49637, 13.2967, and 3.7152 seconds. The optimisation model has resulted in many LDA topics as a classification model. This experiment shows that the highest average accuracy is 61% with alpha 0.1 and beta 0.001.
A hybrid naïve Bayes based on similarity measure to optimize the mixed-data c...TELKOMNIKA JOURNAL
In this paper, a hybrid method has been introduced to improve the classification performance of naïve Bayes (NB) for the mixed dataset and multi-class problems. This proposed method relies on a similarity measure which is applied to portions that are not correctly classified by NB. Since the data contains a multi-valued short text with rare words that limit the NB performance, we have employed an adapted selective classifier based on similarities (CSBS) classifier to exceed the NB limitations and included the rare words in the computation. This action has been achieved by transforming the formula from the product of the probabilities of the categorical variable to its sum weighted by numerical variable. The proposed algorithm has been experimented on card payment transaction data that contains the label of transactions: the multi-valued short text and the transaction amount. Based on K-fold cross validation, the evaluation results confirm that the proposed method achieved better results in terms of precision, recall, and F-score compared to NB and CSBS classifiers separately. Besides, the fact of converting a product form to a sum gives more chance to rare words to optimize the text classification, which is another advantage of the proposed method.
An in-depth exploration of Bangla blog post classificationjournalBEEI
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.
WARRANTS GENERATIONS USING A LANGUAGE MODEL AND A MULTI-AGENT SYSTEMijnlc
Each argument begins with a conclusion, which is followed by one or more premises supporting the
conclusion. The warrant is a critical component of Toulmin's argument model; it explains why the premises
support the claim. Despite its critical role in establishing the claim's veracity, it is frequently omitted or left
implicit, leaving readers to infer. We consider the problem of producing more diverse and high-quality
warrants in response to a claim and evidence. To begin, we employ BART [1] as a conditional sequence tosequence language model to guide the output generation process. On the ARCT dataset [2], we fine-tune
the BART model. Second, we propose the Multi-Agent Network for Warrant Generation as a model for
producing more diverse and high-quality warrants by combining Reinforcement Learning (RL) and
Generative Adversarial Networks (GAN) with the mechanism of mutual awareness of agents. In terms of
warrant generation, our model generates a greater variety of warrants than other baseline models. The
experimental results validate the effectiveness of our proposed hybrid model for generating warrants.
Abstract: Traditional approaches for document classification need data which is labelled for the construction reliable classifiers which are even accurate. Unfortunately, data which is already labelled are rarely available, and often too costly to obtain. For the given learning task for which data which is trained is unavailable, abundant labelled data may be there for a different and related domain. One would like to use the related labelled data as auxiliary information to accomplish the classification task in the target domain. Recently, the paradigm of transfer learning has been introduced to enable effective learning strategies when auxiliary data obey a different probability distribution. A co-clustering based classification algorithm has been previously proposed to tackle cross-domain text classification. In this work, we extend the idea underlying this approach by making the latent semantic relationship between the two domains explicit. This goal is achieved with the use of Wikipedia. As a result, the pathway that allows propagating labels between the two domains not only captures common words, but also semantic concepts based on the content of documents. We empirically demonstrate the efficacy of our semantic-based approach to cross-domain classification using a variety of real data.Keywords: Classification, Clustering, Cross-domain Text Classification, Co-clustering, Labelled data, Traditional Approaches.
Title: Co-Clustering For Cross-Domain Text Classification
Author: Rayala Venkat, Mahanthi Kasaragadda
ISSN 2350-1022
International Journal of Recent Research in Mathematics Computer Science and Information Technology
Paper Publications
Optimisation towards Latent Dirichlet Allocation: Its Topic Number and Collap...IJECEIAES
Latent Dirichlet Allocation (LDA) is a probability model for grouping hidden topics in documents by the number of predefined topics. If conducted incorrectly, determining the amount of K topics will result in limited word correlation with topics. Too large or too small number of K topics causes inaccuracies in grouping topics in the formation of training models. This study aims to determine the optimal number of corpus topics in the LDA method using the maximum likelihood and Minimum Description Length (MDL) approach. The experimental process uses Indonesian news articles with the number of documents at 25, 50, 90, and 600; in each document, the numbers of words are 3898, 7760, 13005, and 4365. The results show that the maximum likelihood and MDL approach result in the same number of optimal topics. The optimal number of topics is influenced by alpha and beta parameters. In addition, the number of documents does not affect the computation times but the number of words does. Computational times for each of those datasets are 2.9721, 6.49637, 13.2967, and 3.7152 seconds. The optimisation model has resulted in many LDA topics as a classification model. This experiment shows that the highest average accuracy is 61% with alpha 0.1 and beta 0.001.
A hybrid naïve Bayes based on similarity measure to optimize the mixed-data c...TELKOMNIKA JOURNAL
In this paper, a hybrid method has been introduced to improve the classification performance of naïve Bayes (NB) for the mixed dataset and multi-class problems. This proposed method relies on a similarity measure which is applied to portions that are not correctly classified by NB. Since the data contains a multi-valued short text with rare words that limit the NB performance, we have employed an adapted selective classifier based on similarities (CSBS) classifier to exceed the NB limitations and included the rare words in the computation. This action has been achieved by transforming the formula from the product of the probabilities of the categorical variable to its sum weighted by numerical variable. The proposed algorithm has been experimented on card payment transaction data that contains the label of transactions: the multi-valued short text and the transaction amount. Based on K-fold cross validation, the evaluation results confirm that the proposed method achieved better results in terms of precision, recall, and F-score compared to NB and CSBS classifiers separately. Besides, the fact of converting a product form to a sum gives more chance to rare words to optimize the text classification, which is another advantage of the proposed method.
An in-depth exploration of Bangla blog post classificationjournalBEEI
Bangla blog is increasing rapidly in the era of information, and consequently, the blog has a diverse layout and categorization. In such an aptitude, automated blog post classification is a comparatively more efficient solution in order to organize Bangla blog posts in a standard way so that users can easily find their required articles of interest. In this research, nine supervised learning models which are Support Vector Machine (SVM), multinomial naïve Bayes (MNB), multi-layer perceptron (MLP), k-nearest neighbours (k-NN), stochastic gradient descent (SGD), decision tree, perceptron, ridge classifier and random forest are utilized and compared for classification of Bangla blog post. Moreover, the performance on predicting blog posts against eight categories, three feature extraction techniques are applied, namely unigram TF-IDF (term frequency-inverse document frequency), bigram TF-IDF, and trigram TF-IDF. The majority of the classifiers show above 80% accuracy. Other performance evaluation metrics also show good results while comparing the selected classifiers.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Sentiment analysis using naive bayes classifier Dev Sahu
This ppt contains a small description of naive bayes classifier algorithm. It is a machine learning approach for detection of sentiment and text classification.
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
Novel text categorization by amalgamation of augmented k nearest neighbourhoo...ijcsity
Machine learning for text classification is the
underpinning
of document
cataloging
, news filtering,
document
steering
and
exemplif
ication
. In text mining realm, effective feature selection is significant to
make the learning task more accurate and competent. One of the
traditional
lazy
text classifier
k
-
Nearest
Neighborhood (
k
NN) has
a
major pitfall in calculating the similarity between
all
the
objects in training and
testing se
t
s,
there by leads to exaggeration of
both
computational complexity
of the algorithm
and
massive
consumption
of
main memory
. To diminish these shortcomings
in
viewpoint
of a
data
-
mining
practitioner
a
n
amalgamati
ve technique is proposed in this paper using
a novel restructured version of
k
NN called
Augmented
k
NN
(AkNN)
and
k
-
Medoids
(kMdd)
clustering.
The proposed work
comprises
preprocesses
on
the
initial training
set
by
imposing
attribute feature selection
for reduc
tion of high dimensionality, also it
detects and excludes the high
-
fliers
samples
in t
he
initial
training set
and
re
structure
s
a
constricted
training
set
.
The kMdd clustering algorithm generates the cluster centers (as interior objects) for each category
and
restructures
the constricted training set
with centroids
. This technique
is
amalgamated with
AkNN
classifier
that
was prearranged with
text mining similarity measure
s.
Eventually, s
ignifican
tweights
and ranks were
assigned to each object in the new
training set based upon the
ir
accessory towards the
object in testing set
.
Experiments
conducted
on Reuters
-
21578 a
UCI benchmark
text mining
data
set
, and
comparisons with
traditional
k
NN
classifier designates
the
referred
method
yield
spreeminentrecital
in b
oth clustering and
classification
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Document Classification Using KNN with Fuzzy Bags of Word Representationsuthi
Abstract — Text classification is used to classify the documents depending on the words, phrases and word combinations according to the declared syntaxes. There are many applications that are using text classification such as artificial intelligence, to maintain the data according to the category and in many other. Some keywords which are called topics are selected to classify the given document. Using these Topics the main idea of the document can be identified. Selecting the Topics is an important task to classify the document according to the category. In this proposed system keywords are extracted from documents using TF-IDF and Word Net. TF-IDF algorithm is mainly used to select the important words by which document can be classified. Word Net is mainly used to find similarity between these candidate words. The words which are having the maximum similarity are considered as Topics(keywords). In this experiment we used TF-IDF model to find the similar words so that to classify the document. Decision tree algorithm gives the better accuracy for text classification when compared to other algorithms fuzzy system to classify text written in natural language according to topic. It is necessary to use a fuzzy classifier for this task, due to the fact that a given text can cover several topics with different degrees. In this context, traditional classifiers are inappropriate, as they attempt to sort each text in a single class in a winner-takes-all fashion. The classifier we proposeautomatically learns its fuzzy rules from training examples. We have applied it to classify news articles, and the results we obtained are promising. The dimensionality of a vector is very important in text classification. We can decrease this dimensionality by using clustering based on fuzzy logic. Depending on the similarity we can classify the document and thus they can be formed into clusters according to their Topics. After formation of clusters one can easily access the documents and save the documents very easily. In this we can find the similarity and summarize the words called Topics which can be used to classify the Documents.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
In this paper firstly I have compared Single Label Text Categorization with Multi Label Text Categorization in detail then I have compared Document Pivoted Categorization with Category Pivoted Categorization in detail. For this purpose I have given the general definition of Text Categorization with its mathematical notation for the purpose of its frugality and cost effectiveness. Then with the help of mathematical notation and set theory ,I have converted the general definitions of Single Label Text Categorization and Multi Label Text Categorization into their respective mathematical representation .Then I discussed Binary Text Categorization as a special case of Single Label Text Categorization. After comparison of Single Label Text Categorization with Multi Label Text Categorization, I found that Single Label Text Categorization or Binary Text Categorization is more general than Multi Label Text Categorization. Thereafter I discussed an algorithm for transformation of Multi Label Classification into Binary Classification and explained the conditions of transformation of Multi Label Classification into Binary Classification. In the second step I compared Document Pivoted Categorization with Category Pivoted Categorization in detail. After comparison we found that Category Pivoted Categorization is more typical and complex than Document Pivoted Categorization. The Category Pivoted Categorization becomes more complicated when new category is added to predefined set of categories and the recurrent classification of documents takes place. Finally I compared Hard Categorization with Ranking Categorization. After comparing them I found that Hard Categorization incorporates ‘Hard Decisions’ about the relevance or belonging of a document to a category. This hard decision is either completely true or completely false. Whereas the Ranking Categorization creates a belonging of a document to a category
according to the estimated appropriateness to the document. The final Ranked List is developed in the Ranking Categorization which is used by the human expert for final decision of Text Categorization.
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEijnlc
We propose an automatic classification system of movie genres based on different features from their textual synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis, and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
Sentiment analysis using naive bayes classifier Dev Sahu
This ppt contains a small description of naive bayes classifier algorithm. It is a machine learning approach for detection of sentiment and text classification.
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERINGIJDKP
Fuzzy logic deals with degrees of truth. In this paper, we have shown how to apply fuzzy logic in text
mining in order to perform document clustering. We took an example of document clustering where the
documents had to be clustered into two categories. The method involved cleaning up the text and stemming
of words. Then, we chose ‘m’ features which differ significantly in their word frequencies (WF), normalized
by document length, between documents belonging to these two clusters. The documents to be clustered
were represented as a collection of ‘m’ normalized WF values. Fuzzy c-means (FCM) algorithm was used
to cluster these documents into two clusters. After the FCM execution finished, the documents in the two
clusters were analysed for the values of their respective ‘m’ features. It was known that documents
belonging to a document type ‘X’ tend to have higher WF values for some particular features. If the
documents belonging to a cluster had higher WF values for those same features, then that cluster was said
to represent ‘X’. By fuzzy logic, we not only get the cluster name, but also the degree to which a document
belongs to a cluster
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTEkevig
We propose an automatic classification system of movie genres based on different features from their textual
synopsis. Our system is first trained on thousands of movie synopsis from online open databases, by learning relationships between textual signatures and movie genres. Then it is tested on other movie synopsis,
and its results are compared to the true genres obtained from the Wikipedia and the Open Movie Database
(OMDB) databases. The results show that our algorithm achieves a classification accuracy exceeding 75%.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
Supervised WSD Using Master- Slave Voting Techniqueiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
USING ONTOLOGIES TO IMPROVE DOCUMENT CLASSIFICATION WITH TRANSDUCTIVE SUPPORT...IJDKP
Many applications of automatic document classification require learning accurately with little training
data. The semi-supervised classification technique uses labeled and unlabeled data for training. This
technique has shown to be effective in some cases; however, the use of unlabeled data is not always
beneficial.
On the other hand, the emergence of web technologies has originated the collaborative development of
ontologies. In this paper, we propose the use of ontologies in order to improve the accuracy and efficiency
of the semi-supervised document classification.
We used support vector machines, which is one of the most effective algorithms that have been studied for
text. Our algorithm enhances the performance of transductive support vector machines through the use of
ontologies. We report experimental results applying our algorithm to three different datasets. Our
experiments show an increment of accuracy of 4% on average and up to 20%, in comparison with the
traditional semi-supervised model.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
DIFFERENCE OF PROBABILITY AND INFORMATION ENTROPY FOR SKILLS CLASSIFICATION A...ijaia
The probability of an event is in the range of [0, 1]. In a sample space S, the value of probability determines whether an outcome is true or false. The probability of an event Pr(A) that will never occur = 0. The probability of the event Pr(B) that will certainly occur = 1. This makes both events A and B thus a certainty. Furthermore, the sum of probabilities Pr(E1) + Pr(E2) + … + Pr(En) of a finite set of events in a given sample space S = 1. Conversely, the difference of the sum of two probabilities that will certainly occur is 0. Firstly, this paper discusses Bayes’ theorem, then complement of probability and the difference of probability for occurrences of learning-events, before applying these in the prediction of learning objects in student learning. Given the sum total of 1; to make recommendation for student learning, this paper submits that the difference of argMaxPr(S) and probability of student-performance quantifies the weight of learning objects for students. Using a dataset of skill-set, the computational procedure demonstrates: i) the probability of skill-set events that has occurred that would lead to higher level learning; ii) the probability of the events that has not occurred that requires subject-matter relearning; iii) accuracy of decision tree in the prediction of student performance into class labels; and iv) information entropy about skill-set data and its implication on student cognitive performance and recommendation of learning [1].
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
Implementation of Naive Bayesian Classifier and Ada-Boost Algorithm Using Mai...ijistjournal
Machine learning [1] is concerned with the design and development of algorithms that allow computers to evolve intelligent behaviors based on empirical data. Weak learner is a learning algorithm with accuracy less than 50%. Adaptive Boosting (Ada-Boost) is a machine learning algorithm may be used to increase accuracy for any weak learning algorithm. This can be achieved by running it on a given weak learner several times, slightly alters data and combines the hypotheses. In this paper, Ada-Boost algorithm is used to increase the accuracy of the weak learner Naïve-Bayesian classifier. The Ada-Boost algorithm iteratively works on the Naïve-Bayesian classifier with normalized weights and it classifies the given input into different classes with some attributes. Maize Expert System is developed to identify the diseases of Maize crop using Ada-Boost algorithm logic as inference mechanism. A separate user interface for the Maize expert system consisting of three different interfaces namely, End-user/farmer, Expert and Admin are presented here. End-user/farmer module may be used for identifying the diseases for the symptoms entered by the farmer. Expert module may be used for adding rules and questions to data set by a domain expert. Admin module may be used for maintenance of the system.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...ijcsa
Active learning is a supervised learning method that is based on the idea that a machine learning algorithm can achieve greater accuracy with fewer labelled training images if it is allowed to choose the image from which it learns. Facial age classification is a technique to classify face images into one of the several predefined age groups. The proposed study applies an active learning approach to facial age classification which allows a classifier to select the data from which it learns. The classifier is initially trained using a small pool of labeled training images. This is achieved by using the bilateral two dimension linear discriminant analysis. Then the most informative unlabeled image is found out from the unlabeled pool using the furthest nearest neighbor criterion, labeled by the user and added to the
appropriate class in the training set. The incremental learning is performed using an incremental version of bilateral two dimension linear discriminant analysis. This active learning paradigm is proposed to be applied to the k nearest neighbor classifier and the support vector machine classifier and to compare the performance of these two classifiers.
A New Active Learning Technique Using Furthest Nearest Neighbour Criterion fo...
data_mining_Projectreport
1. Classification of Optional Practical Training (OPT)
comments using a Naive Bayes classifier
Anand
a3anand@ucsd.edu
Sampath Krishna
svelaga@ucsd.edu
Jorge A. Garza
Guardado
jgarzagu@ucsd.edu
Adithya Apuroop K
akaravad@ucsd.edu
ABSTRACT
This project aims to classify the optional practical training
comments using a naive Bayes classifier. We demonstrate
the effectiveness of the naive Bayes approach and further
enhance its performance using a simplified form of an expec-
tation maximisation algorithm. We explore how sentiments
change over time, and also provide preliminary results that
help in understanding how sentiments vary with ethnicity.
1. INTRODUCTION
OPT is a scheme in which students with F-1 visas are
permitted by the United States Citizenship and Immigra-
tion Services (USCIS) to work for at most one year on a
student visa towards getting practical training to comple-
ment their field of studies. On April 2, 2008, the depart-
ment of homeland security(DHS) announced an extension
to the OPT which was passed by USCIS as an interim rule.
This rule allows students in Science, Technology, Engineer-
ing or Math (STEM) majors, an OPT extension for up to
17 months.
In August 2015, a US federal court gave its verdict on
a lawsuit challenging the 17-month OPT STEM extension.
The court has decided that the interim rule was deficient as
it was not subjected to public notice, comments and opinions
and it vacated the 2008 rule allowing the 17-month exten-
sion. However, a stay was put in place until February 12,
2016. DHS will have until then in order to take action re-
garding the fate of the STEM extension program. This rule
was open to public comments for a one month duration,
ending on Nov 18th
. The comments are publicly available
at [1].
2. THE DATA SET
2.1 Data collection
Data was collected from the Department of Homeland Se-
curity (DHS) web page forum [1] containing at the time,
42,925 comments. This data was obtained over a period
of 30 days, ranging from 19th October to 18th November.
The DHS web page provides a CSV file containing all user
names and comments, but the comments are stored as a
web page link. A script was written to download and then
parse each web page containing the comment for each user
and the resulting data was stored in a JSON file. The data
we used contained the fields:‘userName’ , ‘comment’, ‘do-
cID’,‘receivedDate’, ‘postedDate’
2.2 Dataset Preprocessing
As a pre-processing step, we removed all the punctuations
from the words. We also changed all words to lower case,
although a more rigorous model could make use of the upper
case information to identify stronger sentiments. Finally, all
the common stop words were removed as they convey little
meaning.
2.3 Dataset Labeling
Since the original dataset is unlabeled, we manually la-
beled the first 900 comments as support or oppose. Out of
these, the first 600 were used for training, comments from
601-700 constituted the validation set and 700-900 were used
for testing . We used validation set to pick the best possible
model from a pool of possible models.
2.4 Data visualisation/Exploratory analysis
Figure (see Fig. 1 and Fig. 2) contain the word clouds
of the most common words (after removing stop words) on
the train data for positive and negative labels. Some of
the most commonly found words in supporting comments
were {benefit, support, economy, STEM, international, stu-
dents, good} etc meaning that people supporting the OPT
extension feel that the extension will benefit the economy
and is good for international students. While in opposing
comments we found words like {American, job, worker, stu-
dent, foreign, program} etc meaning that people opposing
are concerned about the jobs being taken away by the for-
eign students.
Figure 1: Positive comments word cloud
2. Figure 2: Negative comments word cloud
The distribution of frequencies of words in documents is
believed to follow Zipf’s law. Zipf’s law suggests that the
frequency of each word is close to inversely proportional to
its rank in the frequency table to the power of a where a
∼ 1. We observe a behavior consistent with the hypothesis
from the plot in Fig. 3
P ∝ 1/na
Figure 3: Zipfs law vs frequency of words in the
dataset
2.5 Predictive task
Our main goal here is to classify whether a given comment
is supporting or opposing. In addition, based on the clas-
sifier we obtained, we also examined how the proportions
of supporting and opposing reviews varied with time. Fi-
nally, we tried to examine the trends on an ethnicity basis.
The main idea behind this is to check if the voting pattern
supports the hypothesis that most Americans oppose OPT
extension, while people from other ethnicity support it.
3. PREVIOUS WORK
Since the data set is recent and relatively small, we didn’t
find any literature specific to this data set except for [2]
which serves as motivation for our analysis in the first place.
We read survey papers on natural language processing and
algorithms for sentiment detection in similar data sets. We
explored the algorithms in [7],[9],[10] for text classification.
Apart from the naive Bayes and iterative maximisation ap-
proach, we came across other interesting algorithms for sen-
timent classification like decision trees, artificial neural net-
works and support vector machines. Decision trees are gen-
erally prone to over fitting on the training data and perform
well in cases where there is a lot of labeled training data
available. We decided to try naive Bayes and clustering as
performant classification techniques. We would have liked
to explore support vector machines and neural networks as
well but time constraints persuaded us to focus our analysis
on these two approaches.
4. ALGORITHMS AND MODELS FOR CLAS-
SIFICATION
Broadly speaking, there are two classes of algorithms that
could be tried to classify the comment labels - supervised
and unsupervised. For unsupervised learning, we tried clus-
tering based on the tf-idf features extracted from the text
with the Eucledian distance metric.
Hierarchical clustering runs in time O(n3
)d, where n is the
number of data points and d is the number of dimensions
of the feature vector, making it very slow for large data
sets. Therefore, we implemented K-means clustering which
is much faster. However, no useful clusters were identified
and the accuracies were no better than those of a random
classifier. This is to be expected because there is no coherent
structure across the different comments - they are of varying
lengths and contain different kinds of vocabulary to express
the same sentiment, thus rendering Eucledian distance as a
very bad distance measure.
Naive Bayes performs particularly well for text classifica-
tion despite the aggressive assumption it makes about in-
dependence. The reason for this is thought to be because,
although naive bayes fails to produce good estimates of the
probabilities, we do not require the absolute values of these,
but only the relative ordering to estimate the MAP estimate.
Reports by [2] suggested that Naive Bayes indeed performs
well on this data set. There are at least two popular versions
of naive Bayes - multinomial and Bernoulli.
Bernoulli naive Bayes makes the assumption that each
document belonging to a class contains occurrence of some
words that are described by the probability distribution of
the words belonging to that class. The Probability of the
document given the class can then be modeled by:
P(doc|class) =
unique w∈doc
P(w|class)
w∈doc
(1 − P(w|class))
On the other hand, multinomial naive Bayes assumes that
the document of a particular class is generated by the follow-
ing generative process - First, the length is chosen according
to some distribution(which we don’t care, as the length does
not depend on the class labels). Then, every word in the doc-
ument is generated by a multinomial distribution over the
words belonging to that class. In this case, the correspond-
ing probability can be modeled by:
P(doc|class) = P(|length(doc)|)
w∈doc
P(w|class)
We implemented both multinomial and Bernoulli naive
Bayes, but we explored only the multinomial version because
it was faster, while giving similar results as the Bernoulli
version
Multinomial models using just uni-grams, just bi-grams,
and using both uni-grams and bi-grams were considered. Ini-
tial results showing the performance on the training set and
validation set can be seen in Table1
From Table1, we observe that the bigram only model
3. Model Train acc Validation acc
Unigram 98% 86%
Bigram 99.5% 86%
Unigram+Bigram 99.33% 88%
Table 1: Table showing the validation errors on the
3 schemes being considered
possibly over fits on the validation set, therefore we only
consider the unigram only model and the unigram+bigram
model. While the naive Bayes in itself performs reasonably
well, its performance can be boosted by augmenting it with
a simple fix.
4.1 Semi supervised estimation
It has been suggested by the authors in [3], that in cases
where the number of training examples is small, the per-
formance of the naive bayes classifier can be improved by
combining it with an expectation maximization algorithm.
In short, the authors suggest to do this :
1. Predict the class probabilities P(class|data) for all ex-
amples in the dataset
2. Retrain the model based on class probabilities esti-
mated in the previous step
The first step above is an expectation step in disguise, and
the second step corresponds to the maximisation. Although
the second step requires us to retrain the model based on the
probabilities in the previous step, we use a relaxed version
of this step as follows:
1. Predict the class probabilities P(class|data) for all ex-
amples in the dataset
2. Retrain the model based on class labels estimated in
the previous step
This algorithm, which we’ll refer to as classification maximi-
sation (CM) algorithm is a convenient approximation to the
more rigorous expectation maximization. What this means,
is that we use the predicted labels as the actual labels and
retrain the model based on these labels until convergence.
These iterations significantly improve the accuracy of the
naive Bayes model by incorporating the knowledge from the
large pool of unlabeled examples. Refer to Figure 2 to see
the performance comparison of the classification maximi-
sation and naive Bayes algorithms. Note that the classifi-
cation maximisation algorithm achieves significantly better
TPR and TNR on the test set as compared to naive Bayes.
Note that TNR is a particularly important term to evaluate
the performance of the classifier, as the negative examples
are relatively rare and one would want to classify them cor-
rectly. After all, an “all positive” classifier would achieve an
accuracy of about 85% on this dataset.
5. RESULTS AND OBSERVATIONS
From the predictions of the classification maximisation
algorithm, we find that approximately 85.17% of the users
support OPT extension while 14.83% oppose it.
5.1 Excerpts of comments from the labeled set
There are certain cases where our naive Bayes model fails
to predict the sentiment correctly. Consider a false negative
classification in our test set :
“OPT is helping to find better workers for the jobs, not
simply give the jobs to foreigners."
The classifier recognises the words jobs and foreigners as
predictive of negative sentiment but doesn’t notice the nega-
tion in the original clause.
Also, a complex viewpoint expressed via contrast and jux-
taposition stumps our classifier. Consider the comment :
“Admittedly, there are Americans who can not find a job.
But there are also foreign students who can not find a job.
The majority of US companies already give priorities to
US workers. As a result, the unemployment percentage of
international students is already higher than that of native
Americans. It is unfair to say that more US workers can
not find a job. We should compare the percentage instead
of the absolute number."
which actually supports the OPT proposal but is pre-
dicted to be a disapproval since the bag-of-words approach
lumps jobs, Americans and workers with negative senti-
ment. The (+,-) log likelihood is (-307.6, -305.9) : which
indicates an edge case for our classifier.
There are a few false positives as well. Let us evaluate
a straightforward negative comment which manages to fool
the classifier :
“I oppose the extension of OPT. Schools, especially public
schools, welcome foreign students because they pay high
tuition. And then, with extension of OPT, they earn back
what they invest and maybe much more. Who is the win-
ner? Obviously, foreign students who get more than they
invested, schools which get a lot of money and companies
which get a lot of comparable cheap workers. Who is the
loser? Obviously not the government, the working Amer-
icans are loser, the middle class is loser. They take risk
of losing their jobs but they don’t get any benefit from
having more and more foreign students."
Here the classifier is incapable of parsing the topic sen-
tence but counts phrases like welcome foreign students and
lot of money towards a false positive prediction. The log
likelihood for positive and negative prediction are -505 and
-513 respectively. The word oppose is present in our list of
words which appear only in negative comments. However
the frequency is a weak 560 which doesn’t sway the negative
probability sufficiently.
However, for a majority of comments, the bag-of-words
approach combined with iterative maximisation works sur-
prisingly well. We will next look at a few comments drawn
at random from the unlabeled dataset and see how our clas-
sifier performs.
5.2 Excerpts of comments from the unlabeled
set
Here are some comments from the unlabeled set. We men-
tion the log likelihood for positive and negative predictions
alongside the comment. A higher log likelihood implies a
greater probability for the classification.
4. Model Train acc Validation acc Test acc TNR TPR
naive Bayes(Uni) 98% 86% 90.5% 40.7% 98.2%
naive Bayes(Uni+Bi) 99.3% 86% 88.5% 29.7% 97.7%
CM Algorithm(Uni) 96.33% 97% 95.5% 92.5% 96%
CM Algorithm(Uni+Bi) 96.5% 96% 96.5% 92.5% 97.1%
Table 2: Table showing the comparison of various algorithms
Consider comments like
“Please stand up for American citizens and say NO to this
travesty."
with a (+,-) log likelihood of (-67.3, -57.9) : clearly clas-
sifying it as a negative comment. True positive comments
like
“International students bring money, skills and jobs in
USA. This rule is not taking away any job from us. In
fact because of this rule more and more jobs are being
created for American citizen with or without degree in
STEM field."
show a wide margin between the pos/neg log likelihoods of
-485.7 and -518.7 respectively.
“I oppose the Department of Homeland Security’s pro-
posed rule that would expand the Optional Practical Train-
ing program. This expansion would allow U.S. tech com-
panies to hire . . . a de facto shadow H-1B program, in
violation of Congressional intent."
(+, -) log likelihood = (-961.6,-819.0). is also classified cor-
rectly.
With a test accuracy of 0.965, the classification maximisa-
tion approach combined with naive Bayes performs compet-
itively compared to other advanced techniques like neural
networks or support vector machines.
5.3 Other interesting observations
We also tried to analyse the data based on the ethnicity
of the users. We were curious to know how people sup-
ported/opposed OPT based on the country of origin. For
this, we have collected the publicly available common first
names and surnames of Americans and Chinese ethnicity.
We couldn’t get the corresponding data for India to per-
form such an analysis. The results are summarized in Fig
4
It was initially quite surprising that a significant fraction
of Americans support the OPT. There might be several rea-
sons for this. Firstly, the database for American names con-
tains many foreign names as well, and this might have cre-
ated conflicts with true foreigners who supported the exten-
sion. Secondly, all American names are not in the database
and that might have resulted in some conflicts too. Never-
theless, upon examination of some reviews, we found that
many Americans were supportive of OPT due to the positive
talent it brings to the country.
We ran the classifier on the entire corpus of around 42,000
comments and plotted the distribution of sentiment over
time. The graph [5] suggests that the initial sentiment was
overwhelmingly positive with negative comments beginning
to trickle in a week after the voting started.
Figure 4: Sentiment breakdown by ethnicity
Figure 5: Sentiment breakdown over time
5. An interesting result was obtained from trying to get the
most negative words that do not occur as often or are not
even listed in the positive words list. For doing so, the entire
dataset was segregated based on the predicted labels, and
the frequency of each negative word was subtracted from
its matching positive word frequency after normalization. If
a negative word occurred a large number of times in the
positive word list, it was removed . Resulting words from
this list can bee seen in Fig. 6
Figure 6: Negative words which do not appear as
much in positive comments
From Fig. 6 it can be noticed for example, that words
like {workers, program, foreign, oppose, homeland, taxes,
wages, medicare, taxes, paying} now appear in the negative
only words list. These words represent that most people that
oppose are concerned mainly with foreign people stealing
jobs from American workers in their own homeland with
this new program and foreigners not paying taxes. An-
other interesting thing was to see the number “765” which
refers to the form I-765 that needs to be filled when apply-
ing for an OPT, and words like “medicare”, meaning that
people opposing are also somehow concerned with this. For
example, consider this comment where words like {wage,
medicare, taxes, pay, social, security} appear.
“American IT jobs should be done by natural born Ameri-
cans, not foreigners, who will work for substandard wages
and be exempt from the taxes that are paid to help sup-
port our economy and social security and Medicare."
Our code is currently hosted at [8].
6. ACKNOWLEDGEMENTS
This project was inspired by a similar analysis done in [2].
7. REFERENCES
[1] http://www.regulations.gov/#!docketDetail;D=
ICEB-2015-0002
[2] https://medium.com/@heretic/on-opt-optional-
practical-training-10ced7051066#.goc1w933d
[3] Nigam, Kamal; McCallum, Andrew; Thrun, Sebastian;
Mitchell, Tom (2000).“Learning to classify text from
labeled and unlabeled documents using EM”
[4] http://immigrationgirl.com/breaking-news-on-opt-
stem-extension-court-says-uscis-rule-allowing-17-
month-stem-extension-is-deficient/
[5] http://vikeshkhanna.webfactional.com/opt
[6] http:
//ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=
1359749&reason=concurrency
[7] http://dl.acm.org/citation.cfm?id=288651
[8] https://github.com/ananducsd/opt-data-mining
[9] Aurangzeb Khan, Baharum Baharudin, Lam Hong
Lee*, Khairullah khan.“A Review of Machine Learning
Algorithms for Text-Documents Classification”
[10] http://www.time.mk/trajkovski/thesis/text-class.pdf