1. BUSINESS EVENT RECOGNITION FROM
ONLINE NEWS ARTICLES
A Project Report
submitted by
MOHAN KASHYAP.P
in partial fulfillment of the requirements
for the award of the degree of
MASTER OF TECHNOLOGY
IN
MACHINE LEARNING AND COMPUTING
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY
Thiruvananthapuram - 695547
May 2015
2. i
CERTIFICATE
This is to certify that the thesis titled ’Business Event Recognition From
Online News Articles’, submitted by Mohan Kashyap.P, to the Indian Insti-
tute of Space Science and Technology, Thiruvananthapuram, for the award of the
degree of MASTER OF TECHNOLOGY, is a bonafide record of the research
work done by him under my supervision. The contents of this thesis, in full or in
parts, have not been submitted to any other Institute or University for the award
of any degree or diploma.
Dr. Sumitra.S
Supervisor
Department of Mathematics
IIST
Dr. Raju K. George
Head of Department
Department of Mathematics
IIST
Place: Thiruvananthapuram
May, 2015
3. ii
DECLARATION
I declare that this thesis titled ’Business Event Recognition From Online
News Articles’ submitted in fulfillment of the Degree of MASTER OF TECH-
NOLOGY is a record of original work carried out by me under the supervision
of Dr. Sumitra .S, and has not formed the basis for the award of any degree,
diploma, associateship, fellowship or other titles in this or any other Institution
or University of higher learning. In keeping with the ethical practice in reporting
scientific information, due acknowledgements have been made wherever the find-
ings of others have been cited.
Mohan Kashyap.P
SC13M055
Place: Thiruvananthapuram
May, 2015
4. iii
Abstract
Business Event Recognition From Online News Articles deals with the ex-
traction of news from text related to business events in three domains Acquisition,
Vendor-supplier and Job. The developed automated model for recognizing busi-
ness events would predict whether the online news article contains a business event
or not. For developing the model, the data related to business events had been
crawled. Since the manual labeling of data was expensive, semi-supervised learn-
ing techniques were used for getting required labeled data and then tagged data
had been pre-processed using techniques of natural language processing. Further
on vectorizers were applied on the text to convert it into numerics using bag-of-
words, word-embedding and word2vec approaches. In the end ensemble classifiers
with bag-of-words approach and CNN(Convolutional Neural Network) using word-
embedding, word2vec approaches were applied on the business event datasets and
the results obtained were found to be promising.
5. Acknowledgements
First and foremost I thank God, The Almighty, for all his blessings. I would
like to express my deepest gratitude to my research supervisor and teacher, Dr.
Sumitra .S for her continuous guidance and motivation without which this research
work would never have been possible. I cannot thank her enough for her limit-
less patience and dedication in correcting my thesis report and molding it into
its present form. Interactions with her taught me the importance of small things
which are often overlooked and an exposure to the art of approaching a problem at
different angles. These lessons will be invaluable for me in my career and personal
life ahead.
Besides my supervisor, I would like to thank my mentor, Mr.Mahesh C.R. of
TataaTsu Idea Labs for allowing me to carry my thesis work in their organization.I
would like to express my deepest gratitude for him for helping me to realize my
abilities and build confidence in me to to solve challenging problems in Machine
Learning turning my theoretical understanding into practical real time implemen-
tation.My sincere thanks also goes to all the faculty members of Mathematics
Department for their encouragement, questions and insightful comments.
I am grateful to my project lead at Tataatsu Idea labs Mr.Vinay and his
team of the Tataatsu Idea labs for helping me in implementation of project work .
I would like to appreciate Research Scholar Shiju.S.Nair for extending his
’any time’ help and thanks to him for providing additional inputs to my work.
last but not the least i would like to thank my classmates and friends in IIST
for their company and for all the fun we had during the two years of M.Tech.Hailing
from Electrical Background not that great in coding special thanks goes to Praveen
and Sailesh for constantly supporting me and guiding through for two years in ma-
chine learning and Arvindh too for inspiring me in certain regards of the course
iv
6. v
work.
Last but not the least, I would like to thank my parents and my sister for
their care, love and support throughout my life.
14. List of Figures
3.1 The Image describes the architecture for Convolutional Neural Net-
work with Sentence Modelling for multichannel architecture . . . . 31
4.1 Variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . 36
4.2 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . 37
4.3 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . 38
4.5 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . 39
4.7 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . . . 41
4.9 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . 42
4.10 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . 43
4.12 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 43
4.13 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . 44
4.14 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . 46
4.16 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 47
4.17 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 47
vii
15. List of Figures viii
4.18 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 48
4.19 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 48
4.20 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition . . . . . . . . . . . . . . 49
4.21 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition . . . . . . . . . . . . . . . . . . . 49
4.22 variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.23 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier . . . . . . . . . . . . 52
4.24 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier . . . . . . . . . . . . . . . . 52
4.25 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier . . . . . . . . . . . 53
4.26 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier . . . . . . . . . . . . . . . . 53
4.27 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier . . . . . . . . . . . . 54
4.28 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier . . . . . . . . . . . . . . . . 54
4.29 Variations in Accuracies and F1-scores for Job event data using
Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.30 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job . . . . . . . . . . . . . . . . . . 57
4.31 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 57
4.32 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job . . . . . . . . . . . . . . . . . . 58
4.33 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 58
4.34 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 59
4.35 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 59
4.36 Variations in Accuracies and F1-scores for Acquisition event data
using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.37 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 62
4.38 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 62
4.39 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 63
16. List of Figures ix
4.40 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 63
4.41 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 64
4.42 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 64
4.43 variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 67
4.44 Confusion matrix for Vendor-supplier with number of estimators as
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.45 Roc curve for for Vendor-supplier with number of estimators as 100 68
4.46 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.47 Confusion matrix for Job with number of estimators as 100 . . . . . 71
4.48 Roc curve for for Job with number of estimators as 100 . . . . . . . 71
4.49 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 73
4.50 Confusion matrix for Acquisition with number of estimators as 100 74
4.51 Roc curve for for Acquisition with number of estimators as 100 . . . 74
4.52 Variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 76
4.53 Confusion matrix for Vendor-supplier with number of estimators as
500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.54 Roc curve for for Vendor-supplier with number of estimators as 500 77
4.55 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.56 Confusion matrix for Job with number of estimators as 500 . . . . . 80
4.57 Roc curve for for Job with number of estimators as 500 . . . . . . . 80
4.58 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 82
4.59 Confusion matrix for Acquisition with number of estimators as 500 83
4.60 Roc curve for for Acquisition with number of estimators as 500 . . . 83
4.61 Variations in Accuracies and F1-scores for Vendor-supplier data for
whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.62 Variations in Accuracies and F1-scores for Job data for whole data
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.63 Variations in Accuracies and F1-scores for Acquisition data 5-folds
accuracy variations for whole data set . . . . . . . . . . . . . . . . . 89
4.64 CNN-rand and CNN-word2vec models for Vendor-supplier on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.65 CNN-rand and CNN-word2vec models for Acquisition on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.66 CNN-rand and CNN-word2vec models for Job on whole data set
with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
17.
18. List of Tables
1.1 Recognition of Named-Event Passages in News Articles and its ap-
plication to our work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 The words and their counts in the sentence1 . . . . . . . . . . . . . 22
2.2 The words and their counts in the sentence2 . . . . . . . . . . . . . 22
4.1 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data . . . . . . . . . . . . . . 36
4.2 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data . . . . . . . . . . . . . . . . . 41
4.3 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data . . . . . . . . . . . . . 46
4.4 Variation in accuracies and F-scores using Active Learning for Vendor-
supplier event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Variation in accuracies and F-scores using Active Learning for Job
event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Variation in accuracies and F-scores using Active Learning for Ac-
quisition event data . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in vendor-supplier data set 66
4.8 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.9 Variation in accuracies and F-scores for random forest classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.10 Variation in test score for accuracy and F-score with Vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.11 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Job data set . . . . . . 69
4.12 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.13 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.14 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 100 70
4.15 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Acquisition data set . . 72
xi
19. List of Tables xii
4.16 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.17 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.18 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.19 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in vendor-supplier data set 75
4.20 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.21 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.22 Variation in test score for accuracy and F-score with vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.23 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Job data set . . . . . . 78
4.24 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.25 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.26 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 500 79
4.27 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Acquisition data set . . 81
4.28 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.29 Variation in accuracies and F-scores for Ada boosting classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.30 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.31 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . 84
4.32 Variation in accuracies and F-scores for Ada Boosting classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 84
4.33 Variation in accuracies and F-scores for Random forest classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 85
4.34 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.35 Variation in accuracies and F-scores for Ada Boosting classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.36 Variation in accuracies and F-scores for Random forest classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
20. List of Tables xiii
4.37 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . 88
4.38 Variation in accuracies and F-scores for Random forest classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 88
4.39 Variation in accuracies and F-scores for Ada boosting classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 89
4.40 Variation in test score for MFN with word embedding . . . . . . . . 90
4.41 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Vendor-supplier on whole data set . . . . . . . . . . . . . 91
4.42 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Acquisition on whole data set . . . . . . . . . . . . . . . 92
4.43 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Job on whole data set . . . . . . . . . . . . . . . . . . . 93
21.
22. List of Abbreviations
POS Parts of speech
NLTK Natural Language Tool Kit
QBC Query By Committee
NLP Natural language processing
IE Information Extraction
IR Information Retrieval
NER Named entity recognizer
ML Machine Learning
CNN Convolutional Neural network
MFN Multilayer feed forward network
TF Term Frequency
IDF Inverse Document Frequency
CBOW Continuous bag of words
ROC Receiver operator characteristic
TPR True Positive Rate
FPR False Positive Rate
TP True Positives
FP False Positives
TN True Negatives
FN False Negatives
xv
23.
24. Chapter 1
Introduction
Textual information present in the web is unstructured and extracting useful infor-
mation from it for a specific purpose is tedious and challenging. So over the years
various methods have been proposed for extraction of useful text. Text mining is
the domain that deals with the process of deriving high-quality information from
unstructured text. The goal of text mining is essentially to convert unstructured
text into structured data and there by extracting some useful information by ap-
plying techniques of natural language processing (NLP) and pattern recognition.
The concept of manual text mining was first introduced in mid-1980’s (Hobbs
et al., 1982). Over the past decade technological advancements in this field have
been significant with building of automated approaches for extraction and analysis
of text. Text mining is composed of five major components: information retrieval,
data mining, machine learning, statistics and computational linguistics.
The application of text mining are in the various domains which includes:
(a) Named entity recognition which deals with identification of named text features
such as people, organization and location(sang et al., 2003). (b) Recognition of
pattern identified entities which deals with extraction of features such as telephone
numbers, e-mail address and built-in database quantities that can be discerned
using regular expression or other pattern matches(Nadeau et al., 2007). (c) Co-
reference deals with the identification of noun phrases and other terms that refer
to these nouns eg: such as her, him, it and their(Soon et al., 2001). (d) Sentiment
analysis which includes extracting various forms of users intent information such
1
25. 2
as sentiment, opinion, mood and emotion. Text analytics techniques are helpful
in the analysis of sentiment at different topics level(pang et al., 2008). (e) Spam
detection which deals with the classification of e-mail as spam or not, based on
application of statistical machine learning and text mining techniques.(Rowe et al.,
2007) (f) News analytics which deals with extraction of vital news or information
content of an interest to the end user. (h) Business event recognition from online
news articles.
Business Event Recognition From Online News Articles captures
semantic signals and identifies pattern from unstructured text to extract business
events in three main domains i.e. acquisition, vendor-supplier and job events
from online news articles. Acquisition business event news pattern in general is
of the context organization acquiring another organization. The keywords used in
acquisition business events scenario are acquire, buy, sell, sold, bought, take-over,
purchase and merger. Vendor-supplier business event news pattern in general
is of the context organization obtaining a contract from another organization to
perform certain task for that organization. The keywords used in vendor-supplier
business event scenario are contract, procure, sign, implement, select, award, work,
agreement, deploy, provide, team, collaborate, deliver and joint. Job business event
news pattern in general is of the context of appointments of persons to prominent
positions, hiring and firing of people within an organizations.
Our thesis deals with the development of an automated model for busi-
ness event recognition from online news articles. For developing the automated
model of Business Event Recognition From Online News Articles, data has been
crawled from different websites such as reutersnews, businesswireindia.com and
prnewswire.com. Since the manual labeling of the data was expensive, the gath-
ered data was subjected to semi-supervised learning techniques and active learn-
ing methods for getting more tagged event data in the domains of acquisition,
vendor-supplier and job. Then the obtained tagged data was pre-processed using
natural language processing techniques. Further on, for the conversion of text
to numerics the bag-of-words, word-embedding and word2vec approaches were
26. 3
used. Final analysis on the business event dataset was performed using ensem-
ble classifiers with bag-of-words approach and convolutional neural network with
word-embedding, word2vec approach.
1.1 Model Architecture
Given a set of online articles or documents which is of interest from the end user,
our developed automated model must predict the class output as whether the
given sentence contains business event related to acquisition, vendor-supplier and
job events.
If the automated model predicts a sentence as a business event then it has
to give out additional information regarding the description of the event such as
entities involved in that particular event like organizations and people. Provid-
ing such additional information helps the end user to make better decisions with
quicker insights.
On daily basis around the world business events are happening. An orga-
nization as a competitor would like to understand the business analytics of other
organizations. The development of an automated approach for identifying such
business events helps in better decision making, increases efficiency and helps to
develop better business strategies for that organization.
1.2 Methods
Given below sections are the methods used for our work.
1.2.1 Natural Language Processing
The concept of information extraction and information retrieval in our work deals
with extraction and retrieval of the business news containing the business event
sentences from the online news articles. The concepts of part-of-speech (POS)
tagging and named entity recognition (NER) are used as part of feature engi-
neering in our work. The pattern of POS tagging is essential in extracting useful
27. 4
semantic features and NER is useful in extracting entity type features like or-
ganizations, persons and location which form the integral part of any business
event. The framework for our project is formed by the concepts of information
extraction (IE) and information retrieval (IR). Discussed below are the methods of
information extraction and retrieval, named entity recognition(NER) and parts-
of-speech(POS) tagging which forms the baseline for implementation of natural
language processing techniques(Liddy, 2001).
Information Extraction and Retrieval: Information extraction and retrieval
deals with searching of required text, extraction of semantic information from text
and storing of retrieved information in a particular form in the database.
Named Entity Recognition: Named entity recognition deals with extraction
from a document of text a set of people or places and type based entities which
include organizations.
Parts Of Speech Tagging: The pattern of POS tagging forms an important
set of features for any NLP related task. Extraction of proper semantic features
is possible with the pattern of POS tags.
1.2.2 Text to Numeric Conversion
The conversion of word to vectors was implemented using the bag-of-words and
word embedding approach. Described below is the overview of these concepts.
In bag-of-words approach piece of text or sentence of a document is repre-
sented as the bag(multiset) of words disregarding the grammar and the word order,
but keeping the multiplicity of the words intact(Harris, 1954). Word embedding
is the collective name for a set of language modeling and feature learning tech-
niques in natural language processing where words from the sentences are mapped
to vectors of real numbers in a low dimensional space, relative to the vocabulary
size(Tomas Mikilov et al.,2013).
One of the major disadvantages of bag-of-words approach is it fails to capture
the semantics of a particular word within in a sentence, because it converts words to
28. 5
vectors disregarding the grammar and the order. Consider the following sentences
where the bag-of-words approach fails.
After drawing money from the Bank Ravi went to the river Bank.
In the bag-of-words approach there is no distinction between financial word Bank
and river Bank. This problem of capturing semantics of the word to a certain
extent overcome by word-embedding. In word embedding each word is represented
by a 100 to 300 dimensional uniformly distributed (i.e U[-1,1])random dense vector.
Word-embedding with window approach captures semantics to certain extent.
1.2.3 Data Labeling
The extracted data labeled in supervised manner were few in number. The sections
below describe the semi-supervised technique and active learning methods.
1.2.3.1 Semi-supervised Technique
The naive Bayes classifier forms the integral part in the implementation of semi-
supervised learning using naive Bayes classifier with expectation maximization to
increase the number of labeled data points (kamalNigam et al.,2006). Discussed
below is an overview of the naive Bayes classifier.
Naive Bayes classifiers are probabilistic classifiers which use the concept of Bayes
theorem. In naive Bayes classifier assumption is made that one feature is condi-
tionally independent from another feature. The modeling of a naive Bayes classifier
is described as follows:
Given a input feature vector x=(x1, x2, ...., xn)T
we need to calculate which class
does this feature vector belong to i.e. p(Yk|x1, x2, ...xn) for each k classes, where
Yk is the output variable for the kth class. Now using the concept of the Bayes
theorem we can rewrite the above probability expression as:
p(Yk|x) = p(Yk)p(x|Yk)
p(x)
where
p(Yk) = are prior probabilities for that particular class
p(x|Yk) = is the maximum likelihood estimator
29. 6
p(x) = is the probability of choosing that particular data point
The naive Bayes classifier framework uses the maximum posteriori rule, to pick
the output which is most probable output for that particular class. Maximum
posteriori probabilities = Prior× Maximum likelihood
Naive Bayes classifier assigns a label ˆy = Yk based on MAP rule, and classifier
prediction is given as follows.
ˆy = argmax
k∈{1,...,K}
p(Yk)
n
i=1
p(xi|Yk).
In text mining the classifier used is Multinomial Naive Bayes classifier with the
bag-of-words approach.
1.2.3.2 Active Learning
Active learning with query by committee approach using ensemble classifiers was
implemented as part of our work to increase the number of labeled data(Abe and
Mamitsuka, 1998). Discussed below is the concept of active learning.
Active learning is a special case of semi-supervised machine learning in which
a learning algorithm is able to interactively query the user (or some other informa-
tion source) to obtain the desired outputs at new data points. There are situations
in which unlabeled data is abundant but manually labeling is expensive. In such
a scenario, learning algorithms can actively query the user for labels. This type
of iterative supervised learning is called active learning. Since the learner chooses
the examples, the number of examples to learn a concept can often be much lower
than the number required in normal supervised learning. The following discussed
are query strategies for querying most informative data points in active learning.
Uncertainty sampling: Uncertainty Sampling deals with labeling of those points
for which current model is least certain about or for which labeled data point en-
tropy value is maximum, by querying with the user.
Query by the committee: A combination of classifiers are trained on the
current labeled data points. Finally take the vote on the predicted labels of the
30. 7
classifiers and query the labels by the user for labels which the classifiers disagree
the most.
Expected model change: Labeling of the data points which would result in
drastic change in the current model.
Expected error reduction: Labeling of those points which would reduce the
most current model’s generalization error.
Variance reduction: Labeling of those points which minimizes the output vari-
ance of the current model the most, which are the points near by to the marginal
hyper-plane in SVM.
1.2.4 Learning Classifiers
The classifiers used in our work were ensemble classifiers and convolutional neural
networks(CNN). The sections below describe the basic overview of concepts that
are required to understand the ensembles methods and CNN that was implemented
as in our work.
1.2.4.1 Ensemble Classifiers
Random forest classifier implemented in our work (Breiman, 2001) is derived
from the concept of bootstrap aggregation technique. Gradient boosting classifier
(Friedman et al., 2001) and ada boost classifier (Freund et al., 1995) implemented
in our work are derived from the boosting algorithm technique. Discussed below
are concepts of ensembles with bagging and boosting.
Ensembles are the concept of combining classifiers by which the performance of
the combined classifier on the model is increased compared to the performance
of each individual classifier on the model. There two different kinds of ensemble
methods in practice, one is bagging also called as bootstrap aggregation and the
other method is boosting.
31. 8
Bagging: In bagging, from a subset of training data at each instance a single
classifier is learnt. From a training set M, then its possible to draw M random
instances using a uniform distribution. These M samples drawn at each instant
can be learned using a classifier and this process is repeated several times. Since
the sampling drawn is with replacement there are chances that certain data points
get picked up twice and certain data points don’t, within the subset of the original
training dataset. A classifier is learnt using these subsets of training data set
for each cycle. Final prediction is based on taking vote of classifier for different
generated datasets.
Boosting: In boosting using a subset of data at each instance a single classifier
or different classifiers are learnt. Boosting technique analyses the performance of
learnt classifier in each instant and forces the classifier to learn those training sam-
ple instances which was incorrectly classified by the classifier. Instead of choosing
the M training instances randomly using a uniform distribution, one chooses the
training instances in such a manner as to favour the instances that have not been
accurately learned by the classifier. The final prediction is performed by taking
the weighted vote of each classifier learnt various different instances.
1.2.5 Convolutional Neural Network
Convolutional neural networks for sentence modelling, trained on softmax-classifier
was implemented in our work(Yoon kin, 2014). Discussed below is the overview
of a generalized convolutional neural network and softmax-classifier.
Convolutional neural network is a type of feed-forward neural network whose archi-
tecture consists of three main layers which are convolutional layer, pooling layer,
fully-connected layer and loss layer. The stacking of these layers forms the full
conv-net architecture.
Convolutional Layer: In conventional convolutional operation of sobel, pewitt
filters on the image is useful in detecting the features of the image such as edge,
32. 9
corners etc., in comparison the convolutional neural net, parameters of each convo-
lutional kernel i.e. (the each filter) is trained by the back-propagation algorithm.
There are many convolution kernels in each layer, and each kernel is replicated
over the entire image with the same parameters. The function of the convolution
operators is to extract different features of the input.
Activation Function: Activation function used in convolutional neural net-
works are hyperbolic-tangent function f(x) = tanh(x), RELU function f(x) =
max(0, x) and sigmoid function f(x) = 1
1+exp(−x)
.
Pooling layer: This set of layer captures the most important feature by per-
forming the max operation on the obtained feature map vector. All such max
features obtained form the penultimate layer.
Fully connected layer: Finally, after several convolutional and max pooling
layers, the high-level reasoning in the neural network is done via fully connected
layers. A fully connected layer takes all neurons in the previous layer (be it fully
connected, pooling, or convolutional) and connects it to every single neuron it has.
Fully connected layers are not spatially located anymore (you can visualize them
as one-dimensional), so there can be no convolutional layers after a fully connected
layer.
Loss layer: From fully connected layer, a soft-max classifier is present at the
output layer with a soft-max loss function, to predict the probabilistic labels.
Soft-max classifier is a classifier obtained from the soft-max function, for a
sample input vector x, the predicted probability output y for the jth class among
K classes is given as:
P(y = j|x) = exTwj
K
k=1 exTwk
1.2.6 Measures used for Analysing the Results:
The performance measures used for our results and analysis are as described as
follows(Powers et al., 2007)
33. 10
1. F-score: F-score is a kind of measure used in information retrieval for mea-
suring the sentence classification performance, since it takes only the true
positives into account and not true negatives while calculating the measure.
The F-score is described as:
F1 = 2×TP
2×TP+FP+FN
2. Confusion matrix: The performance of any classification algorithm can be
visualized by a specific table layout which is called the confusion matrix.
Each column of the confusion matrix represents the instances in a predicted
class, while each row of the confusion matrix represents the instances in an
actual class.
3. ROC curve: It is a plot between TPR and FPR. The TPR desrcibes about
the number of true positive results present in total positive samples. FPR
describes about the set of incorrect positive results present in total negative
samples. Area under the ROC curve is a measure of accuracy.
4. Accuracy: Accuracy of a classification problem is defined as:
accuracy = TP+TN
P+N
1.3 Related Works
The paper which is close our work of Business Event Recognition From Online
News Articles is Recognition of Named-Event Passages in News Articles (Luis-
Marujo et al., 2012). This paper describes about the method for finding named
events in violent behaviour domain and business domain, in the specific passages
of news articles that contain information about such events and report their pre-
liminary evaluation results using techniques of NLP and ML algorithms. The
following table (1.1) describes about the paper Recognition of Named-Event Pas-
sages in News Articles and its application to our work.
As part of feature engineering used in our work, we have used some of the feature
engineering techniques as in (LuisMarujo et al., 2012). The following are the
features extracted used in our work as with reference to this paper.
34. 11
Part-of-speech (POS) pattern of the phrase: (e.g., < noun >, < adj, noun >
, < adj, adj, noun >, etc.) Noun and noun phrases are the most common pattern
observed in key phrases containing named events, verb and verb phrases are less
frequent, and key phrases made of the remaining POS tags are rare.
Extraction of rhetorical signal features: These are set of features which
capture the readers attention in News Events which are continuation, change of
direction, sequence, illustration, emphasis, cause, condition, result, spatial-signals,
comparison/contrast, conclusion and fuzz.
1.4 Thesis Outline
Second chapter deals with the extraction and understanding of business event
data, third chapter deals with the application of machine-learning algorithms on
obtained data, fourth chapter deals with results and analysis on the business event
datasets and finally fifth chapter deals with conclusion of our work.
1.4.1 Second Chapter
This chapter deals with extraction of business event data from web, followed by
pre-processing of the data. Application of feature engineering on the obtained
data and finally converting the data into vectors for applying machine-learning
algorithms.
1.4.2 Third Chapter
This chapter deals with applying semi-supervised techniques on the data to in-
crease the number of data points and understanding of algorithms of different
ensemble classifiers and CNN(convolutional neural network).
35. 12
Table 1.1: Recognition of Named-Event Passages in News Articles and its
application to our work
Recognition of named-event passages
from news articles
Business event recognition from online
news articles
1.Deals with the automatically identi-
fying multi-sentence passages in a news
article that describe named events.
Specifically this paper focuses on ten
event types, five are in the violent
behavior domain: terrorism, suicide
bombing, sex abuse, armed clashes,
and street protests. The other five
are in the business domain: man-
agement changes, mergers and acqui-
sitions, strikes, legal troubles, and
bankruptcy.
1.Our work derived as part of Recog-
nition of Named-Event Passages from
News articles focuses exclusively of
identifying the business events in the
domains of merger and acquisition,
vendor-supplier and job events.
2.The problem is solved as multiclass
classification problem for which the
training data was obtained as part of
crowd-sourcing using amazon mechan-
ical turk to label the data points as
events or not events. Then using en-
semble classifiers for the classification
of these sentences for each event. Fi-
nally aggregating passages containing
the same events using HMM methods.
2.The problem in our case is solved as
a binary classification for the three do-
mains merger and acquisition, vendor-
supplier and job, describing as, that
particular event or not. The proce-
dure used in our case varies as we label
few data points in the supervised way
and then by applying semi-supervised
techniques we increase the number of
labeled data points. Finally applying
ensemble classifiers and convolutional
neural networks for classification of the
labeled data points.
36. 13
1.4.3 Fourth Chapter
This chapter deals with results and analysis of applied machine-learning techniques
which includes semi-supervised learning analysis, ensemble classifier analysis and
analysis of convolutional neural networks.
1.4.4 Fifth Chapter
This chapter deals with challenges encountered while performing the project, con-
clusion of the project and future scope of the project.
1.5 Thesis Contribution
Our work focuses on business event recognition in three domains: acquisition,
vendor-supplier and job. This whole process of identifying the business event
news exclusively in these three domains using the knowledge of machine learning
and NLP techniques is main contribution of our work.
37.
38. Chapter 2
Data Extraction,Data
pre-processing and Feature
Engineering
Initial step in Business Event Recognition is business news extraction and labeling
few of the extracted data, so that it can be formulated as a machine learning
problem. The method of data extraction from web and labeling some of the
extracted data is described in the following section.
2.1 Crawling of Data from Web
There are several methods to crawl the data from web. One of such methods
is described in this section. Every website has its own HTML logic. So sepa-
rate crawling logic had to be written to extract text data from different websites.
Modules used for data extraction in python are Beautiful-soup and Urllib. For ex-
traction of the data for our study, information is extracted from particular websites
such as businesswireindia.com, prnewswire.com and reuters news.
Python language frame work used in our work. Urllib module in python is
used get particular set of pages which has to be accessed within the web. Beautiful-
soup module in python uses the HTML logic and finds the contents present within
that page in the format of the title, subtitle and the description corresponding
15
39. 16
to each content block by block. Finally the extracted title, subtitle and body
contents are stored in the text-file formats.
2.2 Labeling of Extracted Data
Since the business events are in the form of sentences, the text document obtained
as raw text as part of web crawling, is split up into sentences using a natural
language processing toolkit(NLTK) sentence tokenizer. Some of the sentences
were labeled into three classes: merger and acquisition, vendor-supplier and job
describing whether it is a business event or not.
2.2.1 Data Description
Stated below is an illustration of data describing business event or not a business
event in three classes of acquisition, vendor-supplier and job.
2.2.1.1 Acquisition Data Description
Acquisition event: ARMONK, N.Y., April 10, 2014 /PRNewswire/ – IBM
(NYSE: IBM) today announced a definitive agreement to acquire Silverpop, a
privately held software company based in Atlanta, GA.
Non Acquisition event: : Carlyle invests across four segments Corporate Pri-
vate Equity Real Assets Global Market Strategies and Solutions in Africa Asia
Australia Europe the Middle East North America and South America.
2.2.1.2 Vendor-Supplier Data Description
Vendor-Supplier event: : Tri-State signs agreement with NextEra Energy Re-
sources for new wind facility in eastern Colorado under the Director Jack stone;
WESTMINSTER, Colo., Feb. 5, 2014 /PRNewswire/ – Tri-State Generation and
Transmission Association, Inc. announced that it has entered into a 25-year agree-
ment with a subsidiary of NextEra Energy Resources, LLC for a 150 megawatt
40. 17
wind power generating facility to be constructed in eastern Colorado,in the ser-
vice territory of Tri-State member cooperative K. C. Electric Association (Hugo,
Colo.).
Non Vendor-Supplier event: The implementation of the DebMed GMS elec-
tronic hand hygiene monitoring system is a clear demonstration of Meadows Re-
gional Medical Center’s commitment to patient safety, and we are excited to
partner with such a forward-thinking organization that is focused on providing
a state-of-the-art patient environment, said Heather McLarney, vice president of
marketing, DebMed.
2.2.1.3 Job Data Description
Job event: In a note to investors, analysts at FBR Capital Markets said the
appointment of Nadella as Director of the company was a ”safe pick” compared
to choosing an outsider.
Non Job event: This partnership is an example of steps we are taking to sim-
plify and improve the Tactile Medical order process, said Cathy Gendreau,Business
Director.
2.2.2 Data Pre-processing
The extracted business event sentences as raw text as part of data extraction was
cleansed by removing of special characters and stop-words which include words
like the, and, an etc. The stopwords are common between positive class and the
negative class, and hence to enhance the difference between positive class and
negative class we had to remove them. NLTK module in python was used for the
above pre-processing of the data.
2.3 Feature Engineering
To build hand crafted features, we had to observe the extracted unstructured data
and recognize pattern, so that useful features could be extracted. The features
41. 18
extracted are described below and examples for the corresponding features are
taken with reference to vendor-supplier event in (2.2.1.2).
2.3.1 Type 1 Features
Shallow semantic features- records the pattern and semantics of the data, which
consists of the following features (Luismaurijo et al.,2012).
Noun, Noun-phrases and Proper nouns: Entities form an integral part of
business event sentences, so noun phrases and proper nouns are common in sen-
tences containing business events. Using NLTK-parts of speech tagger from the
sentence noun phrase was extracted, correspondingly nouns and proper-nouns.
Example of Noun-phrase: Title agreement Next Era Energy wind facility
eastern Colorado WESTMINSTER Colo. Feb. Generation Transmission Associa-
tion Inc. agreement subsidiary NextEra Energy LLC megawatt wind power facility
eastern Colorado service territory member K. C. Electric Association Hugo Colo.
Word-Capital: If there is a capital letter present in sentence containing the
business event, there is a higher chance of organizations, locations and persons be-
ing present in the sentence, which inturn are entity kind of features which enhances
the event recognition.
Example of Capital words: WESTMINSTER LLC, K.C..Here WESTMIN-
STER is an Location and K.C. is an Organization, an illustration of Entity features
obtained from Capital-Word as feature.
Parts of speech tag pattern: Pattern of parts of speech tags adjective-noun,
i.e noun followed by adjective, adjective-adjective-noun, i.e noun followed by two
adjectives are good sets of features in event recognition. Adjectives are used in
scenarios to describe a noun, so there is higher chance of finding this kind of
scenario in business event sentence. Noun and noun phrases are the most common
42. 19
pattern observed in key phrases of business event sentence, verb and verb phrases
are less frequent and key phrases made of the remaining POS tags are rare.
Example of POS tag pattern Adj-Noun format: new wind 25-year agree-
ment Tri-State member, here adjective is agreement and noun is Tri-State member.
2.3.2 Type 2 Features
Entity type features: To capture the entities present in the business event sentence.
Following described are some of the features.
Organization Name: Organizations names are usually present in sentences
containing business events, which often give additional insights as features in event
recognition.
Example of Organization names: Tri-state Tri-State Generation and Trans-
mission Association, NextEra Energy Resources.
Organization references: Referencing organization entities present in the busi-
ness event sentences are taken as features.
Examples of Organization references: K. C. Electric Association
Location: Location is an important entity describing feature giving more insight
to description of business events.
Example of location as feature : WESTMINSTER Colo. Colorado
Persons: Their is a higher chance of person or a group of people being present
in the sentences that contain business events, so persons are used as features to
enhance business event recognition.
Example of Persons: Jack stone
43. 20
2.3.3 Type 3 Features
Rhetorical features : These are semantic signals which capture readers attention in
an business event sentences, following eleven features are identified in the literature
as described in (Luismaurijo et al.,2012).
Continuation: There are more ideas to come e.g.: moreover, furthermore, in
addition, another.
Change of direction: There is a change of topic e.g.: in spite of, nevertheless,
the opposite, on the contrary.
Sequence: There is an order in the presenting ideas e.g.: in first place, next,
into.
Illustration: Gives an example e.g.: to illustrate, in the same way as, for in-
stance, for example.
Emphasis: Increases the relevance of an idea these are the most important sig-
nals e.g.: it all boils down to, the most substantial issue, should be noted, the crux
of the matter, more than anything else.
Cause, condition or result : There is a conditional or modification coming to
following idea e.g.: if, because, resulting from.
Spatial signals: Denote locations e.g.: in front of, between, adjacent, west, east,
north, south, beyond.
Comparison or contrast: Comparison of two ideas e.g.: analogous to, better,
less than, less, like, either.
Conclusion: Ending the introduction of the idea and may have special impor-
tance e.g.: in summary, from this we see, last of all, hence, finally.
44. 21
Fuzz: There is an idea that is not clear e.g.: looks like, seems like, alleged,
maybe, probably, sort of.
2.4 Description of Vectorizers
All the features extracted with the given sentence has to be converted into vectors
using vectorizers such as Count-vectorizers, TF-IDF vectorizers. The method
used to convert words to vectors is bag of words approach. Following are the two
vectorizers described below using bag of words approach.
2.4.1 Count Vectorizers
This module uses the counts of the words present within a sentence and converts
it into vectors by building the dictionary for the word to vector conversion(Harris,
1954). An illustrative of example count vectorizer is described below.
2.4.1.1 Example of Count Vectorizer
Consider the following two sentences.
a) John likes to watch movies. Mary likes movies too.
b) John also likes to watch football games.
Based on the above two sentences dictionary is constructed as follows:
{ John:1 , likes:2 , to:3 , watch:4 , movies:5 , also:6 , football:7 , games:8 , Mary:9
, too:10 }
The dictionary constructed has 10 distinct words. Using the indexes of the dictio-
nary, each sentence is represented by a 10-entry vector:
sentence1 : [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
sentence2 : [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
where each entry of the vectors refers to count of the corresponding entry in the
dictionary (this is also the histogram representation). For example, in the first
vector (which represents sentence 1), the first two entries are [1,2]. The first entry
corresponds to the word John which is the first word in the dictionary, and its
value is 1 because John appears in the first sentence 1 time. Similarly the second
45. 22
entry corresponds to the word likes which is the second word in the dictionary,
and its value is 2 because likes appears in the first sentence 2 times. This vector
representation does not preserve the order of the words in the original sentences.
2.4.2 Term Frequency and Inverse Document Frequency
Term frequency and inverse document frequency describes importance of a partic-
ular word in the document or a sentence, in a collection of documents (Manning
et al.,2008).
Term-Frequency(Tf)-is defined as the number of occurrences of a particular word
within that document.
Inverse Document Frequency(IDF)-is defined as number of documents containing
the particular word.
For analysis in our work using tf-idf with bag of words approach, we treat each
document as a sentence.
Tf-idf is a short form of term frequency-inverse document frequency, is a
numerical statistic that is intended to reflect how important a particular word is
to a sentence, in a collection of sentences.
2.4.2.1 Formulation of Term Frequency and Inverse Document Fre-
quency
Term-Frequency formulation: The term frequency tf(t,d) describes the num-
ber of times that term t occurs in the sentence d. Two formulations of term fre-
quency is described below:
a)Boolean frequencies: tf(t,d) = 1 if t occurs in d and 0 otherwise.
b)logarithmically scaled frequency: tf(t,d) = 1 + log tf(t,d) if t occurs in d and 0
otherwise.
Inverse Document Frequency formulation: Inverse document frequency is
a measure of how much information a particular word provides in a sentence, in
comparison with the collection of sentences under consideration. Inverse document
frequency measures whether the term is common or rare across all collection of
46. 23
sentences. Mathematically it is described as follows:
idf(t, D) = log N
|{d∈D:t∈d}|
N: total number of sentences in the collection of sentences.
{d ∈ D : t ∈ d} : number of sentences d where the term t appears (i.e., tf(t, d) =
0). If the term is not in the set of sentences, this will lead to a division-by-zero.
It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}|.
2.4.2.2 Description of Combination of TF and IDF
Then tf-idf is calculated as Tf-idf (t, d, D) = tf(t, d) ×idf(t, D)
A high weight in tf-idf is reached by a high term frequency (in the given sentence)
and a low document frequency of the term in the whole collection of sentences,
the weights hence tend to filter out common terms.
2.4.2.3 Example of TF-IDF Vectorizer
Consider term frequency tables (2.1) and (2.2) for a collection consisting of only
two sentences, as listed below.
Table 2.1: The words and their counts in the sentence1
Sentence1
Term Term Count
this 1
is 1
a 2
sample 1
Table 2.2: The words and their counts in the sentence2
Sentence2
Term Term Count
this 1
is 1
another 2
example 3
47. 24
The calculation of tf-idf for the term this in sentence1 is performed as follows.
Term-frequency in its basic form, is just the frequency that we look up in appro-
priate table. In this case, it’s one for the term this in sentence1. IDF for the term
this in sentence1 is given as follows:
idf(this, D) = log N
|{d∈D:t∈d}|
The numerator of the fraction i.e. N, is the number of sentences which is two. The
number of sentences in which this appears is also two, giving the IDF as:
idf(this, D) = log 2
2
= 0
So the tf-idf value is zero for this term and with the basic definition this is true of
any term that occurs in all sentences.
Now consider the example term from the sentence2, which occurs three times but
in only one sentence that is sentence2. For this sample tf-idf of example term is
tf(example, sentence2) = 3
idf(example, D) = log 2
1
≈ 0.3010
tfidf(example, sentence2) = tf(example, sentence2)×ID(example, D) = 3×0.3010 ≈
0.9030
48. Chapter 3
Machine Learning Algorithms
Used For Analysis Of Business
Event Recognition
This chapter discusses the set of machine learning algorithms which were imple-
mented as part of our work. The semi-supervised approach with naive Bayes
expectation- maximization and active learning with QBC are used in our work to
increase the amount labeled data. Gradient boosting classifier, ada boost classifier,
random forest classifier, multilayered feed forward network and covolutional neu-
ral network are used to classify business event data in our work. So the following
sections would give us the detailed understanding regarding these algorithms.
3.1 Semi-supervised Learning using Naive Bayes
Classifier with Expectation-Maximization Al-
gorithm
In this approach first a naive Bayes classifier is built in the standard supervised
fashion from the limited amount of labeled training data and we perform classifica-
tion of the unlabeled data with the naive Bayes model, by noting the probabilities
associated with each class. Then we rebuild a new naive Bayes classifier using all
25
49. 26
the labeled data and unlabeled using the estimated class probabilities as true class
labels. We iterate this process of classifying the unlabeled data and rebuilding the
naive Bayes model until it converges to a stable classifier, and the corresponding
set of labels for the unlabeled data are obtained. The algorithm is summarized
below as in (KamalNigam et al.,2006).
1. Inputs: collections Xl of labeled sentences and Xu of unlabeled sentences.
2. Build an initial naive Bayes classifier K*
from the labeled sentences Xl only.
3. Loop while classifier parameters improve, as measured by the change in
l(K|X, Y ), (the log probability of the labeled and unlabeled data and the
prior)
(a) (E-step) Use the current classifier, K*
, to estimate component member-
ship of each unlabeled sentence, i.e. the probability that each mixture
component (and class) generated each sentence P(Y = cj|X = xi;K*
),
where X and Y are random variables, cj output of jth class and xi is
ith input datapoint.
(b) (M-step) Re-estimate the classifier,K*
, given the estimated component
membership of each sentence. Use maximum a posteriori parameter
estimation to find K*
= arg max
K
P(X, Y |K) P(K)
4. Output is the classifier K*
, that takes an unlabeled sentence and predicts a
class label.
3.2 Active Learning using Ensemble classifiers
with QBC approach
The ensemble classifiers used for QBC approach are gradient boosting classifier,
ada boosting classifier and random forest classifier. Described below is this ap-
proach in brief.
50. 27
3.2.1 Query by committee
In this approach an ensemble of hypotheses is learned and examples that cause
maximum disagreement amongst this committee (with respect to the predicted
categorization) are selected as the most informative examples from a pool of unla-
beled examples. QBC iteratively selects examples to be labeled for training and in
each iteration committee of classifiers based on current training set predict labels.
Then it evaluates the potential utility of each example in the unlabeled set, and
selects a subset of examples with the highest expected utility. The labels for these
examples are acquired and they are transferred to the training set. Typically,
the utility of an example is determined by some measure of disagreement in the
committee about its predicted label. This process is repeated until the number of
available requests for labels is exhausted.
3.3 Ensemble Models for Classification of Busi-
ness Events using Bag-Of-Words Approach
Series of classifiers that were trained on the dataset included SVM, decision-tree
classifier, random-forest classifier, ada boost classifier, gradient boosting classifier
and SGD classifier. Among these classifiers boosting classifiers and random forest
classifiers performed better compared to other classifiers. We used three ensemble
classifiers with decision tree as the base learner, namely gradientboostingclassi-
fier, adaboostclassifier and randomforestclassifier. In the end, classification of the
business event datasets was done by majority voting of these classifiers. The de-
scription and mathematical formulation for each ensemble classifier is given below.
3.3.1 Gradient Boosting Classifier
Boosting algorithms are set of machine learning algorithms, which builds strong
classifier from set of weak classifiers, typically decision tress. Gradient boosting is
one such algorithm which builds the model in a stage-wise fashion, and it general-
izes the model by allowing optimization of an arbitrary differentiable loss function.
51. 28
The differentiable loss function in our case is Binomial deviance loss function.
The algorithm is implemented as follows as described in (Friedman et al.,2001).
Input : training set (Xi, yi), where i = 1....n , Xi ∈ H ⊆ Rn
and yi ∈ [−1, 1]
differential loss function L(y, F(X)) which in our case is Binomial deviance loss
function defined as log(1 + exp(−2yF(X))) and M are the number of iterations .
1. Initialize model with a constant value:
F0(X) =arg min
γ
n
i=1 L(yi, γ).
2. For m = 1 to M:
(a) Compute the pseudo-responses:
rim = − ∂L(yi,F(Xi))
∂F(Xi)
F(X)=Fm−1(X)
for i = 1, . . . , n.
(b) Fit a base learnerhm(X) to pseudo-response, train the pseudo response
using the training set {(Xi, rim)}n
i=1.
(c) Compute multiplier γm by solving the optimization problem:
γm = arg min
γ
n
i=1 L (yi, Fm−1(Xi) + γhm(Xi)).
(d) Update the model: Fm(X) = Fm−1(X) + γmhm(X).
3. Output FM (X) = M
m=1 γmhm(X)
The value of the weight γm is found by an approximated newton raphson solution
given as γm = Xi∈hm
rim
Xi∈hm|rim|(2−|rim|)
3.3.2 AdaBoost Classifier
In adaBoost we assign (non-negative) weights to points in the data set which
are normalized, so that it forms a distribution. In each iteration, we generate a
training set by sampling from the data using the weights, i.e. the data point (Xi, yi)
would be chosen with probability wi, where wi is the current weight for that data
point. We generate the training set by such repeated independent sampling. After
learning the current classifier, we increase the (relative) weights of data points that
are misclassified by the current classifier. We generate a fresh training set using the
modified weights and so on. The final classifier is essentially a weighted majority
52. 29
voting by all the classifiers. The description of the algorithm as in (Freund et al.,
1995) is given below:
Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn
, yi ∈ [−1, 1]
1. Initialize: wi(1) = 1
n
, ∀i, each data point is initialized with equal weight, so
when data points are sampled from the probability distribution the chance
of getting the data point in the training set is equally likely.
2. We assume that there as M classifiers within the Ensembles.
For m=1 to M do
(a) Generate a training set by sampling with wi(m).
(b) Learn classifier hm using this training set.
(c) let ξm = n
i=1 wi(m) I[yi=hm(Xi)] where IA is the indicator function of
A and is defined as
IA = 1 if [yi = hm(Xi)]
IA = 0 if [yi = hm(Xi)]
so ξm is the error computed due to the mth classifier.
(d) Set αm=log(1−ξm
ξm
) computed hypothesis weight, such that αm > 0 be-
cause of the assumption that ξ < 0.5.
(e) Update the weight distribution over the training set as
wi(m + 1)= wi(m) exp(αmI[yi=hm(Xi)])
Normalization of the updated weights so that wi(m+1) is a distribution.
wi(m + 1) =
wi(m+1)
i wi(m+1)
end for
3. Output is final vote h(X) = sgn( M
m=1 αmhm(x)) is the weighted sum of all
classifiers in the ensemble.
In the adaboost algorithm M is a parameter. Due to the sampling with weights,
we can continue the procedure for arbitrary number of iterations. Loss function
used in adaboost algorithm is exponential loss function and for a particular data
point its defined as exp(−yif(Xi))
53. 30
3.3.3 Random Forest Classifiers
Random forests are a combination of tree predictors, such that each tree depends
on the values of a random vector sampled independently, and with the same dis-
tribution for all trees in the forest. The main difference between standard decision
trees and random forest is, in decision trees, each node is split using the best split
among all variables and in random forest, each node is split using the best among
a subset of predictors randomly chosen at that node. In random forest classifier
ntree bootstrap samples are drawn from the original data, and for each obtained
bootstrap sample grow an unpruned classification decision tree, with the following
modification: at each node, rather than choosing the best split among all predic-
tors, randomly sample mtry of the predictors and choose the best split from among
those variables. Predict new data by aggregating the predictions of the ntree trees
(i.e., majority votes for classification). The algorithm is described as follows as
in(Brieman, 2001):
Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn
, where D is the whole
dataset.
for i=1,...,B:
1. Choose a boostrap sample Di from D.
2. Construct a decision Tree Ti from the bootstrap sample Di such that at each
node, choose a random subset of m features and only consider splitting on
those features.
Finally given the testdata Xt take the majority votes for classification. Here B is
the number of bootstrap data sets generated from original data set D.
3.4 Multilayer Feed Forward with Back Propa-
gation using word embedding approach
In this approach word embedding framework was used to convert word to vectors
and followed by applying MFN to classify the business event dataset. Genism
54. 31
module in python was used to build this word embedding, using training of the
words on CBOW(continuous bag of words model) or skip gram model of the un-
supervised neural language model (Tomas Mikolov et al.,2013), where each word
is assigned with an uniformly distributed (U[-1,1]) 100 to 300 dimensonal vector.
Once we have initialized vectors for the each word using word embedding, using
window based approach, we can convert word vectors into a single global sen-
tence vector. The obtained global sentence vector is fed into MFN network with
back-propagation for classification of the sentences using soft-max classifier. The
following is implementation of the algorithm:
1. Initialization of each word in a sentence with a uniformly distributed (U[-
1,1]) dense vector of 100 to 300 dimension.
2. From a given set of words within a sentence, we concatenate word-embedding
vectors to form an matrix for that particular sentence.
3. Choosing an appropriate window size on the obtained matrix and corre-
spondingly applying max-pooling approach based on the window size we
finally obtain a global sentence vector.
4. The obtained global sentence vectors are fed into multilayer feed forward
network with back propagation using soft-max as the loss function. For
regularization of the multilayer feed forward network and to avoid overfitting
of the data, dropout mechanism is adopted.
3.5 Convolutional Neural Networks for Sentence
Classification with unsupervised feature vec-
tor learning
In this model a simple CNN is trained with one layer of convolution on top of
word vectors obtained from an unsupervised neural language model(Yoon kin,
2014). These vectors were trained by (Mikolov et al.,2013) on 100 billion words
55. 32
Figure 3.1: The Image describes the architecture for Convolutional Neural
Network with Sentence Modelling for multichannel architecture
of Google news, and is a publicly available model. The following figure (3.1) de-
scribes the architecture of the CNN for sentence modeling.
let N be the number of sentences in the vocabulary and n be the number of words
in the particular sentence, where xi ∈ Rk
be the k-dimensional word vector corre-
sponding to the i-th word in the sentence. A sentence of length n (padded where
necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concate-
nation of words xi , xi+1 , . . . , xi+j. The weight vector w is initialized with
56. 33
a random uniformly distributed matrix of size Rh×k
. A convolution operation
involves a filter weight matrix w, which is applied to a window of h words of a par-
ticular sentence to produce a new feature. For example, a feature ci is generated
from a window of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic
tangent. This filter is applied to each possible window of words in the sentence
[x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map.
c = [c1, c2, ..., cn−h+1]
with c ∈ Rn−h+1
, We then apply a max-pooling operation over the feature map
and take the maximum value c∗
= max[c] as the feature corresponding to this
particular filter. The idea is to capture the most important feature one with the
highest value for each feature map. This pooling scheme naturally deals with
variable sentence lengths. We have described the process by which one feature is
extracted from one filter. The model uses multiple filters (with varying window
sizes) to obtain multiple features. These features are also called as unsupervised
features, because they are obtained by applications of different filters with variable
window sizes randomly. These features form the penultimate layer and are passed
to a fully connected soft-max layer whose output is the probability distribution
over labels.
To avoid overfitting of CNN models, drop-out mechanism is adopted.
3.5.1 Variations in CNN sentence models
CNN-rand: Our baseline model where all words are randomly initialized and
then modified during training.
CNN-static: A model with pre-trained vectors from word2vec. All words in-
cluding the unknown ones that are randomly initialized are kept static and only
the other parameters of the model are learned. Initializing word vectors with those
57. 34
obtained from an unsupervised neural language model is a popular method to im-
prove performance in the absence of a large supervised training set. We use the
publicly available word2vec vectors that were trained on 100 billion words from
Google news. The vectors have dimensionality of 300 and were trained using the
continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in
the set of pre-trained words are initialized randomly.
58. Chapter 4
Results and Discussions
In this chapter we discuss about the results obtained from the machine learning
algorithms that were applied in our work.
1. Semi-supervised learning approach using naive Bayes with expectation-maximization
and active learning with QBC to increase the number of labeled data points.
2. The ensemble classifiers, MFN and CNN models to classify the obtained
business data.
Described below are the results and analysis of the algorithms.
4.1 Semi-supervised Learning Implementation us-
ing Naive Bayes with Expectation Maximiza-
tion
Initially we had few data points which were labeled in supervised manner. To
formulate and solve the problem as Business Event Classification problem, our
primary objective was to increase the number of labeled data points.
In accordance with the algorithm of semi-supervised learning using naive Bayes
classifier with expectation maximization explained in section 3.1, the following are
the results in three domains of acquisition, vendor-supplier and job events with
the training data taken as 30%, 40% and 50% of the whole dataset and rest of the
pool as test data.
35
59. 36
4.1.1 Results and Analysis of Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754. Stated below
are some of the observations made for large pool of unlabeled test data, by varying
the data points in test data and train data. Table (4.1) and figure (4.1) shows
the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.2),(4.4)
and(4.6) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and the remaining corresponding part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.3),(4.5) and (4.7) displays the ROC curves
for variations of 30% , 40% and 50 % as training data and corresponding remaining
part as the test data.
Analysis: We observe an increase in accuracy and F-scores as there is an in-
crease in the number of training data points which is as expected. But increase
in accuracies are of higher values compared to the increase in F-scores, because
number of true negatives are more in number compared to true positives. The
confusion matrix plot shows slight variations in number of true positives and true
negatives as the number of training data points are increased. The ROC curve
shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
60. 37
Table 4.1: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data
Semi-supervised learning using naive Bayes for vendor-supplier dataset
Training data points in per-
centage
Accuracy F-scores Description on dataset
30 0.5597 0.5915 Testing data=527,training
data=227
40 0.7434 0.65 Testing data=454,training
data=300
50 0.7765 0.674 Testing data=376,training
data=376
Figure 4.1: Variations in Accuracies and F1-scores for Vendor-supplier data
using Naive-Bayes, semi-supervised technique
61. 38
Figure 4.2: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP
Figure 4.3: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP
62. 39
Figure 4.4: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP
Figure 4.5: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP
63. 40
Figure 4.6: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP
Figure 4.7: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP
64. 41
4.1.2 Results and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2810 data points. Stated
below some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data. Table (4.2) and figure (4.8)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.9),(4.11)
and(4.13) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.10),(4.12) and (4.14) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is an increase in the number of training data, we observe
an increase in accuracy and F-scores. But there is a vast difference in values of
accuracies compared to F-scores, because the number of true negatives are very
high in comparison to true positives which are low in number, which is clearly
visible in our confusion matrix plot. The ROC curve shows an increase in TPR
and area under the ROC curve for increase in the number of training datapoints.
65. 42
Table 4.2: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data
Semi-supervised learning using naive Bayes for Job dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7483 0.4444 Testing data=1967,training
data=842
40 0.7544 0.4863 Testing data=1686,training
data=1123
50 0.8014 0.52 Testing data=1405,training
data=1404
Figure 4.8: Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique
66. 43
Figure 4.9: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB
Figure 4.10: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB
67. 44
Figure 4.11: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB
Figure 4.12: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB
68. 45
Figure 4.13: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB
Figure 4.14: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB
69. 46
4.1.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Stated below are some of the observations made for large pool of unlabeled test
data, by varying the data points in test data and train data. Table (4.3) and
figure (4.15) shows the variations of accuracies and F-scores for 30% , 40% and
50 % as training data and corresponding remaining part as the test data. The
figures (4.16),(4.18) and(4.20) displays the confusion matrix for variations of 30%
, 40% and 50 % as the training data and corresponding remaining part as the test
data. The confusion matrix gives insights regarding the number of true-positives,
true-negatives, false-positives and false-negatives. Figures (4.17),(4.19) and (4.21)
displays the ROC curves for variations of 30% , 40% and 50 % as the training data
and corresponding remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are slightly higher
compared to the increase in accuracies. Because number of true positives are
more in compared to true negatives, due to this classifier is more biased towards
the positive class compared to negative. So the amount of false positives are higher
in this scenario, which is clearly visible from the confusion matrix plots. The ROC
curve shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
70. 47
Table 4.3: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data
Semi-supervised learning using naive Bayes for Acquisition dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7929 0.8178 Testing data=966,training
data=413
40 0.7989 0.82 Testing data=828,training
data=521
50 0.8057 0.8241 Testing data=689,training
data=690
Figure 4.15: Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique
71. 48
Figure 4.16: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition
Figure 4.17: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition
72. 49
Figure 4.18: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition
Figure 4.19: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition
73. 50
Figure 4.20: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition
Figure 4.21: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition
74. 51
4.2 Active Learning implementation by Query
by committee approach
In accordance with the algorithm of active learning explained in section 3.1.2,
following are some of the results in three domains of acquisition, vendor-supplier
and job events with the training data taken as 30% ,40% and 50 % of the whole
dataset and prediction of test data using majority voting of three ensemble clas-
sifiers gradient boosting classifier, ada boost classifier and random forest classifier
(i.e. query by committee approach).
4.2.1 Results and Analysis for Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.4) and figure (4.22)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.23),(4.25)
and(4.27) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.24),(4.26) and (4.28) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: We observe an increase in accuracy and F-scores, as there is increase
in the number of training data. But increase in accuracies are of higher values
compared to the increase in F-scores because number of true negatives are more in
compared to true positives. This method performs better compared to the semi-
supervised naive Bayes classifier. The confusion matrix plot shows slight variations
in number of true positives and true negatives as the number of training data points
are increased. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints.
75. 52
Table 4.4: Variation in accuracies and F-scores using Active Learning for
Vendor-supplier event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.842 0.7348 Testing data=529,training
data=225
40 0.84 0.7352 Testing data=454,training
data=300
50 0.8643 0.76 Testing data=376,training
data=376
Figure 4.22: variations in Accuracies and F1-scores for Vendor-supplier data
using Active learning
76. 53
Figure 4.23: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier
Figure 4.24: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier
77. 54
Figure 4.25: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier
Figure 4.26: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier
78. 55
Figure 4.27: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier
Figure 4.28: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier
79. 56
4.2.2 Result and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2809 data points. Fol-
lowing are some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data.Table (4.5) and figure (4.29)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.30),(4.32)
and(4.34) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.31),(4.33) and (4.35) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is increase in the number of training data, we observe an
increase in accuracy and F-scores. But there is a vast difference in accuracies
compared to F-scores, because number of true negatives are very high in compared
to true positives which are low in number which is clearly visible in our confusion
matrix plot. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints. The performance of this
method is better compared to the semi-supervised naive Bayes classifier which is
clearly visible from our results.
80. 57
Table 4.5: Variation in accuracies and F-scores using Active Learning for Job
event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.9054 0.6204 Testing data=1967,training
data=842
40 0.9116 0.6558 Testing data=1686,training
data=1123
50 0.9216 0.6758 Testing data=1405,training
data=1404
Figure 4.29: Variations in Accuracies and F1-scores for Job event data using
Active learning
81. 58
Figure 4.30: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job
Figure 4.31: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job
82. 59
Figure 4.32: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job
Figure 4.33: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job
83. 60
Figure 4.34: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job
Figure 4.35: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job
84. 61
4.2.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.6) and figure (4.36)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.37),(4.39)
and(4.41) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.38),(4.40) and (4.42) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are equivalent to the
increase in accuracies. The confusion matrix plots show that the number of true
positives and true negatives are nearly equal in number. The ROC curve shows an
increase in TPR and area under the ROC curve for increase in the number of train-
ing datapoints. This method shows slight improvement in accuracies compared to
the semi-supervised naive Bayes classifier.