BUSINESS EVENT RECOGNITION FROM
ONLINE NEWS ARTICLES
A Project Report
submitted by
MOHAN KASHYAP.P
in partial fulfillment of the requirements
for the award of the degree of
MASTER OF TECHNOLOGY
IN
MACHINE LEARNING AND COMPUTING
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY
Thiruvananthapuram - 695547
May 2015
i
CERTIFICATE
This is to certify that the thesis titled ’Business Event Recognition From
Online News Articles’, submitted by Mohan Kashyap.P, to the Indian Insti-
tute of Space Science and Technology, Thiruvananthapuram, for the award of the
degree of MASTER OF TECHNOLOGY, is a bonafide record of the research
work done by him under my supervision. The contents of this thesis, in full or in
parts, have not been submitted to any other Institute or University for the award
of any degree or diploma.
Dr. Sumitra.S
Supervisor
Department of Mathematics
IIST
Dr. Raju K. George
Head of Department
Department of Mathematics
IIST
Place: Thiruvananthapuram
May, 2015
ii
DECLARATION
I declare that this thesis titled ’Business Event Recognition From Online
News Articles’ submitted in fulfillment of the Degree of MASTER OF TECH-
NOLOGY is a record of original work carried out by me under the supervision
of Dr. Sumitra .S, and has not formed the basis for the award of any degree,
diploma, associateship, fellowship or other titles in this or any other Institution
or University of higher learning. In keeping with the ethical practice in reporting
scientific information, due acknowledgements have been made wherever the find-
ings of others have been cited.
Mohan Kashyap.P
SC13M055
Place: Thiruvananthapuram
May, 2015
iii
Abstract
Business Event Recognition From Online News Articles deals with the ex-
traction of news from text related to business events in three domains Acquisition,
Vendor-supplier and Job. The developed automated model for recognizing busi-
ness events would predict whether the online news article contains a business event
or not. For developing the model, the data related to business events had been
crawled. Since the manual labeling of data was expensive, semi-supervised learn-
ing techniques were used for getting required labeled data and then tagged data
had been pre-processed using techniques of natural language processing. Further
on vectorizers were applied on the text to convert it into numerics using bag-of-
words, word-embedding and word2vec approaches. In the end ensemble classifiers
with bag-of-words approach and CNN(Convolutional Neural Network) using word-
embedding, word2vec approaches were applied on the business event datasets and
the results obtained were found to be promising.
Acknowledgements
First and foremost I thank God, The Almighty, for all his blessings. I would
like to express my deepest gratitude to my research supervisor and teacher, Dr.
Sumitra .S for her continuous guidance and motivation without which this research
work would never have been possible. I cannot thank her enough for her limit-
less patience and dedication in correcting my thesis report and molding it into
its present form. Interactions with her taught me the importance of small things
which are often overlooked and an exposure to the art of approaching a problem at
different angles. These lessons will be invaluable for me in my career and personal
life ahead.
Besides my supervisor, I would like to thank my mentor, Mr.Mahesh C.R. of
TataaTsu Idea Labs for allowing me to carry my thesis work in their organization.I
would like to express my deepest gratitude for him for helping me to realize my
abilities and build confidence in me to to solve challenging problems in Machine
Learning turning my theoretical understanding into practical real time implemen-
tation.My sincere thanks also goes to all the faculty members of Mathematics
Department for their encouragement, questions and insightful comments.
I am grateful to my project lead at Tataatsu Idea labs Mr.Vinay and his
team of the Tataatsu Idea labs for helping me in implementation of project work .
I would like to appreciate Research Scholar Shiju.S.Nair for extending his
’any time’ help and thanks to him for providing additional inputs to my work.
last but not the least i would like to thank my classmates and friends in IIST
for their company and for all the fun we had during the two years of M.Tech.Hailing
from Electrical Background not that great in coding special thanks goes to Praveen
and Sailesh for constantly supporting me and guiding through for two years in ma-
chine learning and Arvindh too for inspiring me in certain regards of the course
iv
v
work.
Last but not the least, I would like to thank my parents and my sister for
their care, love and support throughout my life.
Contents
Acknowledgements iv
List of Figures vi
List of Tables ix
List of Abbreviations xii
1 Introduction 1
1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . 3
Information Extraction and Retrieval: . . . . . . . . . 4
Named Entity Recognition: . . . . . . . . . . . . . . . 4
Parts Of Speech Tagging: . . . . . . . . . . . . . . . . 4
1.2.2 Text to Numeric Conversion . . . . . . . . . . . . . . . . . . 4
1.2.3 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3.1 Semi-supervised Technique . . . . . . . . . . . . . . 5
1.2.3.2 Active Learning . . . . . . . . . . . . . . . . . . . . 6
Uncertainty sampling: . . . . . . . . . . . . . . . . . . 6
Query by the committee: . . . . . . . . . . . . . . . . 6
Expected model change: . . . . . . . . . . . . . . . . 7
Expected error reduction: . . . . . . . . . . . . . . . . 7
Variance reduction: . . . . . . . . . . . . . . . . . . . 7
1.2.4 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . 7
Bagging: . . . . . . . . . . . . . . . . . . . . . . . . . 8
Boosting: . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . 8
Convolutional Layer: . . . . . . . . . . . . . . . . . . 8
Activation Function: . . . . . . . . . . . . . . . . . . 9
Pooling layer: . . . . . . . . . . . . . . . . . . . . . . 9
Fully connected layer: . . . . . . . . . . . . . . . . . . 9
Loss layer: . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.6 Measures used for Analysing the Results: . . . . . . . . . . . 9
i
Contents ii
1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Part-of-speech (POS) pattern of the phrase: . . . . . 11
Extraction of rhetorical signal features: . . . . . . . . 11
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Second Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Third Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Fourth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Fifth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Data Extraction,Data pre-processing and Feature Engineering 14
2.1 Crawling of Data from Web . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Labeling of Extracted Data . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1.1 Acquisition Data Description . . . . . . . . . . . . 15
Acquisition event: . . . . . . . . . . . . . . . . . . . . 15
Non Acquisition event: . . . . . . . . . . . . . . . . . 15
2.2.1.2 Vendor-Supplier Data Description . . . . . . . . . . 15
Vendor-Supplier event: . . . . . . . . . . . . . . . . . 15
Non Vendor-Supplier event: . . . . . . . . . . . . . . 16
2.2.1.3 Job Data Description . . . . . . . . . . . . . . . . . 16
Job event: . . . . . . . . . . . . . . . . . . . . . . . . 16
Non Job event: . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Type 1 Features . . . . . . . . . . . . . . . . . . . . . . . . . 17
Noun, Noun-phrases and Proper nouns: . . . . . . . . 17
Example of Noun-phrase: . . . . . . . . . . . . . . . . 17
Word-Capital: . . . . . . . . . . . . . . . . . . . . . . 17
Example of Capital words: . . . . . . . . . . . . . . . 17
Parts of speech tag pattern: . . . . . . . . . . . . . . 17
Example of POS tag pattern Adj-Noun format: . . . 18
2.3.2 Type 2 Features . . . . . . . . . . . . . . . . . . . . . . . . . 18
Organization Name: . . . . . . . . . . . . . . . . . . . 18
Example of Organization names: . . . . . . . . . . . . 18
Organization references: . . . . . . . . . . . . . . . . 18
Examples of Organization references: . . . . . . . . . 18
Location: . . . . . . . . . . . . . . . . . . . . . . . . . 18
Example of location as feature . . . . . . . . . . . . . 18
Persons: . . . . . . . . . . . . . . . . . . . . . . . . . 18
Example of Persons: . . . . . . . . . . . . . . . . . . . 18
2.3.3 Type 3 Features . . . . . . . . . . . . . . . . . . . . . . . . . 19
Continuation: . . . . . . . . . . . . . . . . . . . . . . 19
Change of direction: . . . . . . . . . . . . . . . . . . . 19
Contents iii
Sequence: . . . . . . . . . . . . . . . . . . . . . . . . 19
Illustration: . . . . . . . . . . . . . . . . . . . . . . . 19
Emphasis: . . . . . . . . . . . . . . . . . . . . . . . . 19
Cause, condition or result : . . . . . . . . . . . . . . . 19
Spatial signals: . . . . . . . . . . . . . . . . . . . . . 19
Comparison or contrast: . . . . . . . . . . . . . . . . 19
Conclusion: . . . . . . . . . . . . . . . . . . . . . . . 19
Fuzz: . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Description of Vectorizers . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Count Vectorizers . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1.1 Example of Count Vectorizer . . . . . . . . . . . . 20
2.4.2 Term Frequency and Inverse Document Frequency . . . . . . 21
2.4.2.1 Formulation of Term Frequency and Inverse Doc-
ument Frequency . . . . . . . . . . . . . . . . . . . 21
Term-Frequency formulation: . . . . . . . . . . . . . . 21
Inverse Document Frequency formulation: . . . . . . . 21
2.4.2.2 Description of Combination of TF and IDF . . . . 22
2.4.2.3 Example of TF-IDF Vectorizer . . . . . . . . . . . 22
3 Machine Learning Algorithms Used For Analysis Of Business
Event Recognition 24
3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation-
Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Active Learning using Ensemble classifiers with QBC approach . . . 25
3.2.1 Query by committee . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Ensemble Models for Classification of Business Events using Bag-
Of-Words Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Gradient Boosting Classifier . . . . . . . . . . . . . . . . . . 26
3.3.2 AdaBoost Classifier . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Random Forest Classifiers . . . . . . . . . . . . . . . . . . . 29
3.4 Multilayer Feed Forward with Back Propagation using word em-
bedding approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Convolutional Neural Networks for Sentence Classification with un-
supervised feature vector learning . . . . . . . . . . . . . . . . . . . 30
3.5.1 Variations in CNN sentence models . . . . . . . . . . . . . . 32
CNN-rand: . . . . . . . . . . . . . . . . . . . . . . . . 32
CNN-static: . . . . . . . . . . . . . . . . . . . . . . . 32
4 Results and Discussions 34
4.1 Semi-supervised Learning Implementation using Naive Bayes with
Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Results and Analysis of Vendor-Supplier Event Data . . . . 35
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Results and Analysis for Job Event Data . . . . . . . . . . . 40
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 40
Contents iv
4.1.3 Result and Analysis for Acquisition Event Data . . . . . . . 45
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Active Learning implementation by Query by committee approach . 50
4.2.1 Results and Analysis for Vendor-Supplier Event Data . . . . 50
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Result and Analysis for Job Event Data . . . . . . . . . . . 55
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Result and Analysis for Acquisition Event Data . . . . . . . 60
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Comparison of Semi-supervised techniques and Active learning ap-
proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Results of Ensemble Classifiers with different Parameter tuning . . 65
4.4.1 Analysis for vendor-supplier event Data using 100 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 65
4.4.2 Analysis for Job event Data using 100 estimators within the
ensemble as the parameter . . . . . . . . . . . . . . . . . . . 68
4.4.3 Analysis for Acquisition event Data using 100 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 71
4.4.4 Analysis for Vendor-Supplier event Data using 500 estima-
tors within the ensemble as the parameter . . . . . . . . . . 74
4.4.5 Analysis for Job event Data using 500 estimators within the
ensemble as the parameter . . . . . . . . . . . . . . . . . . . 77
4.4.6 Analysis for Acquisition event Data using 500 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 80
4.5 Final Accuracies and F-score estimates for the model . . . . . . . . 83
4.5.1 Final Analysis of Vendor-Supplier Dataset . . . . . . . . . . 84
4.5.2 Final Analysis of Job Dataset . . . . . . . . . . . . . . . . . 85
4.5.3 Final Analysis of Acquisition Dataset . . . . . . . . . . . . . 87
4.6 Results obtained for MFN with Word Embedding . . . . . . . . . . 90
4.7 Results obtained for Convolutional Neural Networks . . . . . . . . . 90
4.7.1 Analysis for Vendor-Supplier Data using CNN-rand and CNN-
word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.2 Analysis for Acquisition Data using CNN-rand and CNN-
word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.3 Analysis for Job using CNN-rand and CNN-word2vec Model 92
4.8 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5 Conclusions and Future work 95
5.1 Challenges Encountered in Business Event Recognition . . . . . . . 95
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Contents v
Bibliography 99
List of Figures
3.1 The Image describes the architecture for Convolutional Neural Net-
work with Sentence Modelling for multichannel architecture . . . . 31
4.1 Variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . 36
4.2 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . 37
4.3 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . 38
4.5 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . 39
4.7 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . . . 41
4.9 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . 42
4.10 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . 43
4.12 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 43
4.13 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . 44
4.14 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . 46
4.16 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 47
4.17 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 47
vii
List of Figures viii
4.18 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 48
4.19 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 48
4.20 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition . . . . . . . . . . . . . . 49
4.21 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition . . . . . . . . . . . . . . . . . . . 49
4.22 variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.23 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier . . . . . . . . . . . . 52
4.24 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier . . . . . . . . . . . . . . . . 52
4.25 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier . . . . . . . . . . . 53
4.26 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier . . . . . . . . . . . . . . . . 53
4.27 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier . . . . . . . . . . . . 54
4.28 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier . . . . . . . . . . . . . . . . 54
4.29 Variations in Accuracies and F1-scores for Job event data using
Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.30 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job . . . . . . . . . . . . . . . . . . 57
4.31 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 57
4.32 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job . . . . . . . . . . . . . . . . . . 58
4.33 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 58
4.34 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 59
4.35 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 59
4.36 Variations in Accuracies and F1-scores for Acquisition event data
using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.37 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 62
4.38 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 62
4.39 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 63
List of Figures ix
4.40 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 63
4.41 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 64
4.42 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 64
4.43 variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 67
4.44 Confusion matrix for Vendor-supplier with number of estimators as
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.45 Roc curve for for Vendor-supplier with number of estimators as 100 68
4.46 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.47 Confusion matrix for Job with number of estimators as 100 . . . . . 71
4.48 Roc curve for for Job with number of estimators as 100 . . . . . . . 71
4.49 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 73
4.50 Confusion matrix for Acquisition with number of estimators as 100 74
4.51 Roc curve for for Acquisition with number of estimators as 100 . . . 74
4.52 Variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 76
4.53 Confusion matrix for Vendor-supplier with number of estimators as
500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.54 Roc curve for for Vendor-supplier with number of estimators as 500 77
4.55 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.56 Confusion matrix for Job with number of estimators as 500 . . . . . 80
4.57 Roc curve for for Job with number of estimators as 500 . . . . . . . 80
4.58 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 82
4.59 Confusion matrix for Acquisition with number of estimators as 500 83
4.60 Roc curve for for Acquisition with number of estimators as 500 . . . 83
4.61 Variations in Accuracies and F1-scores for Vendor-supplier data for
whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.62 Variations in Accuracies and F1-scores for Job data for whole data
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.63 Variations in Accuracies and F1-scores for Acquisition data 5-folds
accuracy variations for whole data set . . . . . . . . . . . . . . . . . 89
4.64 CNN-rand and CNN-word2vec models for Vendor-supplier on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.65 CNN-rand and CNN-word2vec models for Acquisition on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.66 CNN-rand and CNN-word2vec models for Job on whole data set
with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
List of Tables
1.1 Recognition of Named-Event Passages in News Articles and its ap-
plication to our work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 The words and their counts in the sentence1 . . . . . . . . . . . . . 22
2.2 The words and their counts in the sentence2 . . . . . . . . . . . . . 22
4.1 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data . . . . . . . . . . . . . . 36
4.2 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data . . . . . . . . . . . . . . . . . 41
4.3 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data . . . . . . . . . . . . . 46
4.4 Variation in accuracies and F-scores using Active Learning for Vendor-
supplier event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Variation in accuracies and F-scores using Active Learning for Job
event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Variation in accuracies and F-scores using Active Learning for Ac-
quisition event data . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in vendor-supplier data set 66
4.8 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.9 Variation in accuracies and F-scores for random forest classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.10 Variation in test score for accuracy and F-score with Vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.11 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Job data set . . . . . . 69
4.12 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.13 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.14 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 100 70
4.15 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Acquisition data set . . 72
xi
List of Tables xii
4.16 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.17 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.18 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.19 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in vendor-supplier data set 75
4.20 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.21 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.22 Variation in test score for accuracy and F-score with vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.23 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Job data set . . . . . . 78
4.24 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.25 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.26 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 500 79
4.27 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Acquisition data set . . 81
4.28 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.29 Variation in accuracies and F-scores for Ada boosting classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.30 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.31 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . 84
4.32 Variation in accuracies and F-scores for Ada Boosting classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 84
4.33 Variation in accuracies and F-scores for Random forest classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 85
4.34 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.35 Variation in accuracies and F-scores for Ada Boosting classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.36 Variation in accuracies and F-scores for Random forest classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
List of Tables xiii
4.37 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . 88
4.38 Variation in accuracies and F-scores for Random forest classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 88
4.39 Variation in accuracies and F-scores for Ada boosting classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 89
4.40 Variation in test score for MFN with word embedding . . . . . . . . 90
4.41 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Vendor-supplier on whole data set . . . . . . . . . . . . . 91
4.42 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Acquisition on whole data set . . . . . . . . . . . . . . . 92
4.43 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Job on whole data set . . . . . . . . . . . . . . . . . . . 93
List of Abbreviations
POS Parts of speech
NLTK Natural Language Tool Kit
QBC Query By Committee
NLP Natural language processing
IE Information Extraction
IR Information Retrieval
NER Named entity recognizer
ML Machine Learning
CNN Convolutional Neural network
MFN Multilayer feed forward network
TF Term Frequency
IDF Inverse Document Frequency
CBOW Continuous bag of words
ROC Receiver operator characteristic
TPR True Positive Rate
FPR False Positive Rate
TP True Positives
FP False Positives
TN True Negatives
FN False Negatives
xv
Chapter 1
Introduction
Textual information present in the web is unstructured and extracting useful infor-
mation from it for a specific purpose is tedious and challenging. So over the years
various methods have been proposed for extraction of useful text. Text mining is
the domain that deals with the process of deriving high-quality information from
unstructured text. The goal of text mining is essentially to convert unstructured
text into structured data and there by extracting some useful information by ap-
plying techniques of natural language processing (NLP) and pattern recognition.
The concept of manual text mining was first introduced in mid-1980’s (Hobbs
et al., 1982). Over the past decade technological advancements in this field have
been significant with building of automated approaches for extraction and analysis
of text. Text mining is composed of five major components: information retrieval,
data mining, machine learning, statistics and computational linguistics.
The application of text mining are in the various domains which includes:
(a) Named entity recognition which deals with identification of named text features
such as people, organization and location(sang et al., 2003). (b) Recognition of
pattern identified entities which deals with extraction of features such as telephone
numbers, e-mail address and built-in database quantities that can be discerned
using regular expression or other pattern matches(Nadeau et al., 2007). (c) Co-
reference deals with the identification of noun phrases and other terms that refer
to these nouns eg: such as her, him, it and their(Soon et al., 2001). (d) Sentiment
analysis which includes extracting various forms of users intent information such
1
2
as sentiment, opinion, mood and emotion. Text analytics techniques are helpful
in the analysis of sentiment at different topics level(pang et al., 2008). (e) Spam
detection which deals with the classification of e-mail as spam or not, based on
application of statistical machine learning and text mining techniques.(Rowe et al.,
2007) (f) News analytics which deals with extraction of vital news or information
content of an interest to the end user. (h) Business event recognition from online
news articles.
Business Event Recognition From Online News Articles captures
semantic signals and identifies pattern from unstructured text to extract business
events in three main domains i.e. acquisition, vendor-supplier and job events
from online news articles. Acquisition business event news pattern in general is
of the context organization acquiring another organization. The keywords used in
acquisition business events scenario are acquire, buy, sell, sold, bought, take-over,
purchase and merger. Vendor-supplier business event news pattern in general
is of the context organization obtaining a contract from another organization to
perform certain task for that organization. The keywords used in vendor-supplier
business event scenario are contract, procure, sign, implement, select, award, work,
agreement, deploy, provide, team, collaborate, deliver and joint. Job business event
news pattern in general is of the context of appointments of persons to prominent
positions, hiring and firing of people within an organizations.
Our thesis deals with the development of an automated model for busi-
ness event recognition from online news articles. For developing the automated
model of Business Event Recognition From Online News Articles, data has been
crawled from different websites such as reutersnews, businesswireindia.com and
prnewswire.com. Since the manual labeling of the data was expensive, the gath-
ered data was subjected to semi-supervised learning techniques and active learn-
ing methods for getting more tagged event data in the domains of acquisition,
vendor-supplier and job. Then the obtained tagged data was pre-processed using
natural language processing techniques. Further on, for the conversion of text
to numerics the bag-of-words, word-embedding and word2vec approaches were
3
used. Final analysis on the business event dataset was performed using ensem-
ble classifiers with bag-of-words approach and convolutional neural network with
word-embedding, word2vec approach.
1.1 Model Architecture
Given a set of online articles or documents which is of interest from the end user,
our developed automated model must predict the class output as whether the
given sentence contains business event related to acquisition, vendor-supplier and
job events.
If the automated model predicts a sentence as a business event then it has
to give out additional information regarding the description of the event such as
entities involved in that particular event like organizations and people. Provid-
ing such additional information helps the end user to make better decisions with
quicker insights.
On daily basis around the world business events are happening. An orga-
nization as a competitor would like to understand the business analytics of other
organizations. The development of an automated approach for identifying such
business events helps in better decision making, increases efficiency and helps to
develop better business strategies for that organization.
1.2 Methods
Given below sections are the methods used for our work.
1.2.1 Natural Language Processing
The concept of information extraction and information retrieval in our work deals
with extraction and retrieval of the business news containing the business event
sentences from the online news articles. The concepts of part-of-speech (POS)
tagging and named entity recognition (NER) are used as part of feature engi-
neering in our work. The pattern of POS tagging is essential in extracting useful
4
semantic features and NER is useful in extracting entity type features like or-
ganizations, persons and location which form the integral part of any business
event. The framework for our project is formed by the concepts of information
extraction (IE) and information retrieval (IR). Discussed below are the methods of
information extraction and retrieval, named entity recognition(NER) and parts-
of-speech(POS) tagging which forms the baseline for implementation of natural
language processing techniques(Liddy, 2001).
Information Extraction and Retrieval: Information extraction and retrieval
deals with searching of required text, extraction of semantic information from text
and storing of retrieved information in a particular form in the database.
Named Entity Recognition: Named entity recognition deals with extraction
from a document of text a set of people or places and type based entities which
include organizations.
Parts Of Speech Tagging: The pattern of POS tagging forms an important
set of features for any NLP related task. Extraction of proper semantic features
is possible with the pattern of POS tags.
1.2.2 Text to Numeric Conversion
The conversion of word to vectors was implemented using the bag-of-words and
word embedding approach. Described below is the overview of these concepts.
In bag-of-words approach piece of text or sentence of a document is repre-
sented as the bag(multiset) of words disregarding the grammar and the word order,
but keeping the multiplicity of the words intact(Harris, 1954). Word embedding
is the collective name for a set of language modeling and feature learning tech-
niques in natural language processing where words from the sentences are mapped
to vectors of real numbers in a low dimensional space, relative to the vocabulary
size(Tomas Mikilov et al.,2013).
One of the major disadvantages of bag-of-words approach is it fails to capture
the semantics of a particular word within in a sentence, because it converts words to
5
vectors disregarding the grammar and the order. Consider the following sentences
where the bag-of-words approach fails.
After drawing money from the Bank Ravi went to the river Bank.
In the bag-of-words approach there is no distinction between financial word Bank
and river Bank. This problem of capturing semantics of the word to a certain
extent overcome by word-embedding. In word embedding each word is represented
by a 100 to 300 dimensional uniformly distributed (i.e U[-1,1])random dense vector.
Word-embedding with window approach captures semantics to certain extent.
1.2.3 Data Labeling
The extracted data labeled in supervised manner were few in number. The sections
below describe the semi-supervised technique and active learning methods.
1.2.3.1 Semi-supervised Technique
The naive Bayes classifier forms the integral part in the implementation of semi-
supervised learning using naive Bayes classifier with expectation maximization to
increase the number of labeled data points (kamalNigam et al.,2006). Discussed
below is an overview of the naive Bayes classifier.
Naive Bayes classifiers are probabilistic classifiers which use the concept of Bayes
theorem. In naive Bayes classifier assumption is made that one feature is condi-
tionally independent from another feature. The modeling of a naive Bayes classifier
is described as follows:
Given a input feature vector x=(x1, x2, ...., xn)T
we need to calculate which class
does this feature vector belong to i.e. p(Yk|x1, x2, ...xn) for each k classes, where
Yk is the output variable for the kth class. Now using the concept of the Bayes
theorem we can rewrite the above probability expression as:
p(Yk|x) = p(Yk)p(x|Yk)
p(x)
where
p(Yk) = are prior probabilities for that particular class
p(x|Yk) = is the maximum likelihood estimator
6
p(x) = is the probability of choosing that particular data point
The naive Bayes classifier framework uses the maximum posteriori rule, to pick
the output which is most probable output for that particular class. Maximum
posteriori probabilities = Prior× Maximum likelihood
Naive Bayes classifier assigns a label ˆy = Yk based on MAP rule, and classifier
prediction is given as follows.
ˆy = argmax
k∈{1,...,K}
p(Yk)
n
i=1
p(xi|Yk).
In text mining the classifier used is Multinomial Naive Bayes classifier with the
bag-of-words approach.
1.2.3.2 Active Learning
Active learning with query by committee approach using ensemble classifiers was
implemented as part of our work to increase the number of labeled data(Abe and
Mamitsuka, 1998). Discussed below is the concept of active learning.
Active learning is a special case of semi-supervised machine learning in which
a learning algorithm is able to interactively query the user (or some other informa-
tion source) to obtain the desired outputs at new data points. There are situations
in which unlabeled data is abundant but manually labeling is expensive. In such
a scenario, learning algorithms can actively query the user for labels. This type
of iterative supervised learning is called active learning. Since the learner chooses
the examples, the number of examples to learn a concept can often be much lower
than the number required in normal supervised learning. The following discussed
are query strategies for querying most informative data points in active learning.
Uncertainty sampling: Uncertainty Sampling deals with labeling of those points
for which current model is least certain about or for which labeled data point en-
tropy value is maximum, by querying with the user.
Query by the committee: A combination of classifiers are trained on the
current labeled data points. Finally take the vote on the predicted labels of the
7
classifiers and query the labels by the user for labels which the classifiers disagree
the most.
Expected model change: Labeling of the data points which would result in
drastic change in the current model.
Expected error reduction: Labeling of those points which would reduce the
most current model’s generalization error.
Variance reduction: Labeling of those points which minimizes the output vari-
ance of the current model the most, which are the points near by to the marginal
hyper-plane in SVM.
1.2.4 Learning Classifiers
The classifiers used in our work were ensemble classifiers and convolutional neural
networks(CNN). The sections below describe the basic overview of concepts that
are required to understand the ensembles methods and CNN that was implemented
as in our work.
1.2.4.1 Ensemble Classifiers
Random forest classifier implemented in our work (Breiman, 2001) is derived
from the concept of bootstrap aggregation technique. Gradient boosting classifier
(Friedman et al., 2001) and ada boost classifier (Freund et al., 1995) implemented
in our work are derived from the boosting algorithm technique. Discussed below
are concepts of ensembles with bagging and boosting.
Ensembles are the concept of combining classifiers by which the performance of
the combined classifier on the model is increased compared to the performance
of each individual classifier on the model. There two different kinds of ensemble
methods in practice, one is bagging also called as bootstrap aggregation and the
other method is boosting.
8
Bagging: In bagging, from a subset of training data at each instance a single
classifier is learnt. From a training set M, then its possible to draw M random
instances using a uniform distribution. These M samples drawn at each instant
can be learned using a classifier and this process is repeated several times. Since
the sampling drawn is with replacement there are chances that certain data points
get picked up twice and certain data points don’t, within the subset of the original
training dataset. A classifier is learnt using these subsets of training data set
for each cycle. Final prediction is based on taking vote of classifier for different
generated datasets.
Boosting: In boosting using a subset of data at each instance a single classifier
or different classifiers are learnt. Boosting technique analyses the performance of
learnt classifier in each instant and forces the classifier to learn those training sam-
ple instances which was incorrectly classified by the classifier. Instead of choosing
the M training instances randomly using a uniform distribution, one chooses the
training instances in such a manner as to favour the instances that have not been
accurately learned by the classifier. The final prediction is performed by taking
the weighted vote of each classifier learnt various different instances.
1.2.5 Convolutional Neural Network
Convolutional neural networks for sentence modelling, trained on softmax-classifier
was implemented in our work(Yoon kin, 2014). Discussed below is the overview
of a generalized convolutional neural network and softmax-classifier.
Convolutional neural network is a type of feed-forward neural network whose archi-
tecture consists of three main layers which are convolutional layer, pooling layer,
fully-connected layer and loss layer. The stacking of these layers forms the full
conv-net architecture.
Convolutional Layer: In conventional convolutional operation of sobel, pewitt
filters on the image is useful in detecting the features of the image such as edge,
9
corners etc., in comparison the convolutional neural net, parameters of each convo-
lutional kernel i.e. (the each filter) is trained by the back-propagation algorithm.
There are many convolution kernels in each layer, and each kernel is replicated
over the entire image with the same parameters. The function of the convolution
operators is to extract different features of the input.
Activation Function: Activation function used in convolutional neural net-
works are hyperbolic-tangent function f(x) = tanh(x), RELU function f(x) =
max(0, x) and sigmoid function f(x) = 1
1+exp(−x)
.
Pooling layer: This set of layer captures the most important feature by per-
forming the max operation on the obtained feature map vector. All such max
features obtained form the penultimate layer.
Fully connected layer: Finally, after several convolutional and max pooling
layers, the high-level reasoning in the neural network is done via fully connected
layers. A fully connected layer takes all neurons in the previous layer (be it fully
connected, pooling, or convolutional) and connects it to every single neuron it has.
Fully connected layers are not spatially located anymore (you can visualize them
as one-dimensional), so there can be no convolutional layers after a fully connected
layer.
Loss layer: From fully connected layer, a soft-max classifier is present at the
output layer with a soft-max loss function, to predict the probabilistic labels.
Soft-max classifier is a classifier obtained from the soft-max function, for a
sample input vector x, the predicted probability output y for the jth class among
K classes is given as:
P(y = j|x) = exTwj
K
k=1 exTwk
1.2.6 Measures used for Analysing the Results:
The performance measures used for our results and analysis are as described as
follows(Powers et al., 2007)
10
1. F-score: F-score is a kind of measure used in information retrieval for mea-
suring the sentence classification performance, since it takes only the true
positives into account and not true negatives while calculating the measure.
The F-score is described as:
F1 = 2×TP
2×TP+FP+FN
2. Confusion matrix: The performance of any classification algorithm can be
visualized by a specific table layout which is called the confusion matrix.
Each column of the confusion matrix represents the instances in a predicted
class, while each row of the confusion matrix represents the instances in an
actual class.
3. ROC curve: It is a plot between TPR and FPR. The TPR desrcibes about
the number of true positive results present in total positive samples. FPR
describes about the set of incorrect positive results present in total negative
samples. Area under the ROC curve is a measure of accuracy.
4. Accuracy: Accuracy of a classification problem is defined as:
accuracy = TP+TN
P+N
1.3 Related Works
The paper which is close our work of Business Event Recognition From Online
News Articles is Recognition of Named-Event Passages in News Articles (Luis-
Marujo et al., 2012). This paper describes about the method for finding named
events in violent behaviour domain and business domain, in the specific passages
of news articles that contain information about such events and report their pre-
liminary evaluation results using techniques of NLP and ML algorithms. The
following table (1.1) describes about the paper Recognition of Named-Event Pas-
sages in News Articles and its application to our work.
As part of feature engineering used in our work, we have used some of the feature
engineering techniques as in (LuisMarujo et al., 2012). The following are the
features extracted used in our work as with reference to this paper.
11
Part-of-speech (POS) pattern of the phrase: (e.g., < noun >, < adj, noun >
, < adj, adj, noun >, etc.) Noun and noun phrases are the most common pattern
observed in key phrases containing named events, verb and verb phrases are less
frequent, and key phrases made of the remaining POS tags are rare.
Extraction of rhetorical signal features: These are set of features which
capture the readers attention in News Events which are continuation, change of
direction, sequence, illustration, emphasis, cause, condition, result, spatial-signals,
comparison/contrast, conclusion and fuzz.
1.4 Thesis Outline
Second chapter deals with the extraction and understanding of business event
data, third chapter deals with the application of machine-learning algorithms on
obtained data, fourth chapter deals with results and analysis on the business event
datasets and finally fifth chapter deals with conclusion of our work.
1.4.1 Second Chapter
This chapter deals with extraction of business event data from web, followed by
pre-processing of the data. Application of feature engineering on the obtained
data and finally converting the data into vectors for applying machine-learning
algorithms.
1.4.2 Third Chapter
This chapter deals with applying semi-supervised techniques on the data to in-
crease the number of data points and understanding of algorithms of different
ensemble classifiers and CNN(convolutional neural network).
12
Table 1.1: Recognition of Named-Event Passages in News Articles and its
application to our work
Recognition of named-event passages
from news articles
Business event recognition from online
news articles
1.Deals with the automatically identi-
fying multi-sentence passages in a news
article that describe named events.
Specifically this paper focuses on ten
event types, five are in the violent
behavior domain: terrorism, suicide
bombing, sex abuse, armed clashes,
and street protests. The other five
are in the business domain: man-
agement changes, mergers and acqui-
sitions, strikes, legal troubles, and
bankruptcy.
1.Our work derived as part of Recog-
nition of Named-Event Passages from
News articles focuses exclusively of
identifying the business events in the
domains of merger and acquisition,
vendor-supplier and job events.
2.The problem is solved as multiclass
classification problem for which the
training data was obtained as part of
crowd-sourcing using amazon mechan-
ical turk to label the data points as
events or not events. Then using en-
semble classifiers for the classification
of these sentences for each event. Fi-
nally aggregating passages containing
the same events using HMM methods.
2.The problem in our case is solved as
a binary classification for the three do-
mains merger and acquisition, vendor-
supplier and job, describing as, that
particular event or not. The proce-
dure used in our case varies as we label
few data points in the supervised way
and then by applying semi-supervised
techniques we increase the number of
labeled data points. Finally applying
ensemble classifiers and convolutional
neural networks for classification of the
labeled data points.
13
1.4.3 Fourth Chapter
This chapter deals with results and analysis of applied machine-learning techniques
which includes semi-supervised learning analysis, ensemble classifier analysis and
analysis of convolutional neural networks.
1.4.4 Fifth Chapter
This chapter deals with challenges encountered while performing the project, con-
clusion of the project and future scope of the project.
1.5 Thesis Contribution
Our work focuses on business event recognition in three domains: acquisition,
vendor-supplier and job. This whole process of identifying the business event
news exclusively in these three domains using the knowledge of machine learning
and NLP techniques is main contribution of our work.
Chapter 2
Data Extraction,Data
pre-processing and Feature
Engineering
Initial step in Business Event Recognition is business news extraction and labeling
few of the extracted data, so that it can be formulated as a machine learning
problem. The method of data extraction from web and labeling some of the
extracted data is described in the following section.
2.1 Crawling of Data from Web
There are several methods to crawl the data from web. One of such methods
is described in this section. Every website has its own HTML logic. So sepa-
rate crawling logic had to be written to extract text data from different websites.
Modules used for data extraction in python are Beautiful-soup and Urllib. For ex-
traction of the data for our study, information is extracted from particular websites
such as businesswireindia.com, prnewswire.com and reuters news.
Python language frame work used in our work. Urllib module in python is
used get particular set of pages which has to be accessed within the web. Beautiful-
soup module in python uses the HTML logic and finds the contents present within
that page in the format of the title, subtitle and the description corresponding
15
16
to each content block by block. Finally the extracted title, subtitle and body
contents are stored in the text-file formats.
2.2 Labeling of Extracted Data
Since the business events are in the form of sentences, the text document obtained
as raw text as part of web crawling, is split up into sentences using a natural
language processing toolkit(NLTK) sentence tokenizer. Some of the sentences
were labeled into three classes: merger and acquisition, vendor-supplier and job
describing whether it is a business event or not.
2.2.1 Data Description
Stated below is an illustration of data describing business event or not a business
event in three classes of acquisition, vendor-supplier and job.
2.2.1.1 Acquisition Data Description
Acquisition event: ARMONK, N.Y., April 10, 2014 /PRNewswire/ – IBM
(NYSE: IBM) today announced a definitive agreement to acquire Silverpop, a
privately held software company based in Atlanta, GA.
Non Acquisition event: : Carlyle invests across four segments Corporate Pri-
vate Equity Real Assets Global Market Strategies and Solutions in Africa Asia
Australia Europe the Middle East North America and South America.
2.2.1.2 Vendor-Supplier Data Description
Vendor-Supplier event: : Tri-State signs agreement with NextEra Energy Re-
sources for new wind facility in eastern Colorado under the Director Jack stone;
WESTMINSTER, Colo., Feb. 5, 2014 /PRNewswire/ – Tri-State Generation and
Transmission Association, Inc. announced that it has entered into a 25-year agree-
ment with a subsidiary of NextEra Energy Resources, LLC for a 150 megawatt
17
wind power generating facility to be constructed in eastern Colorado,in the ser-
vice territory of Tri-State member cooperative K. C. Electric Association (Hugo,
Colo.).
Non Vendor-Supplier event: The implementation of the DebMed GMS elec-
tronic hand hygiene monitoring system is a clear demonstration of Meadows Re-
gional Medical Center’s commitment to patient safety, and we are excited to
partner with such a forward-thinking organization that is focused on providing
a state-of-the-art patient environment, said Heather McLarney, vice president of
marketing, DebMed.
2.2.1.3 Job Data Description
Job event: In a note to investors, analysts at FBR Capital Markets said the
appointment of Nadella as Director of the company was a ”safe pick” compared
to choosing an outsider.
Non Job event: This partnership is an example of steps we are taking to sim-
plify and improve the Tactile Medical order process, said Cathy Gendreau,Business
Director.
2.2.2 Data Pre-processing
The extracted business event sentences as raw text as part of data extraction was
cleansed by removing of special characters and stop-words which include words
like the, and, an etc. The stopwords are common between positive class and the
negative class, and hence to enhance the difference between positive class and
negative class we had to remove them. NLTK module in python was used for the
above pre-processing of the data.
2.3 Feature Engineering
To build hand crafted features, we had to observe the extracted unstructured data
and recognize pattern, so that useful features could be extracted. The features
18
extracted are described below and examples for the corresponding features are
taken with reference to vendor-supplier event in (2.2.1.2).
2.3.1 Type 1 Features
Shallow semantic features- records the pattern and semantics of the data, which
consists of the following features (Luismaurijo et al.,2012).
Noun, Noun-phrases and Proper nouns: Entities form an integral part of
business event sentences, so noun phrases and proper nouns are common in sen-
tences containing business events. Using NLTK-parts of speech tagger from the
sentence noun phrase was extracted, correspondingly nouns and proper-nouns.
Example of Noun-phrase: Title agreement Next Era Energy wind facility
eastern Colorado WESTMINSTER Colo. Feb. Generation Transmission Associa-
tion Inc. agreement subsidiary NextEra Energy LLC megawatt wind power facility
eastern Colorado service territory member K. C. Electric Association Hugo Colo.
Word-Capital: If there is a capital letter present in sentence containing the
business event, there is a higher chance of organizations, locations and persons be-
ing present in the sentence, which inturn are entity kind of features which enhances
the event recognition.
Example of Capital words: WESTMINSTER LLC, K.C..Here WESTMIN-
STER is an Location and K.C. is an Organization, an illustration of Entity features
obtained from Capital-Word as feature.
Parts of speech tag pattern: Pattern of parts of speech tags adjective-noun,
i.e noun followed by adjective, adjective-adjective-noun, i.e noun followed by two
adjectives are good sets of features in event recognition. Adjectives are used in
scenarios to describe a noun, so there is higher chance of finding this kind of
scenario in business event sentence. Noun and noun phrases are the most common
19
pattern observed in key phrases of business event sentence, verb and verb phrases
are less frequent and key phrases made of the remaining POS tags are rare.
Example of POS tag pattern Adj-Noun format: new wind 25-year agree-
ment Tri-State member, here adjective is agreement and noun is Tri-State member.
2.3.2 Type 2 Features
Entity type features: To capture the entities present in the business event sentence.
Following described are some of the features.
Organization Name: Organizations names are usually present in sentences
containing business events, which often give additional insights as features in event
recognition.
Example of Organization names: Tri-state Tri-State Generation and Trans-
mission Association, NextEra Energy Resources.
Organization references: Referencing organization entities present in the busi-
ness event sentences are taken as features.
Examples of Organization references: K. C. Electric Association
Location: Location is an important entity describing feature giving more insight
to description of business events.
Example of location as feature : WESTMINSTER Colo. Colorado
Persons: Their is a higher chance of person or a group of people being present
in the sentences that contain business events, so persons are used as features to
enhance business event recognition.
Example of Persons: Jack stone
20
2.3.3 Type 3 Features
Rhetorical features : These are semantic signals which capture readers attention in
an business event sentences, following eleven features are identified in the literature
as described in (Luismaurijo et al.,2012).
Continuation: There are more ideas to come e.g.: moreover, furthermore, in
addition, another.
Change of direction: There is a change of topic e.g.: in spite of, nevertheless,
the opposite, on the contrary.
Sequence: There is an order in the presenting ideas e.g.: in first place, next,
into.
Illustration: Gives an example e.g.: to illustrate, in the same way as, for in-
stance, for example.
Emphasis: Increases the relevance of an idea these are the most important sig-
nals e.g.: it all boils down to, the most substantial issue, should be noted, the crux
of the matter, more than anything else.
Cause, condition or result : There is a conditional or modification coming to
following idea e.g.: if, because, resulting from.
Spatial signals: Denote locations e.g.: in front of, between, adjacent, west, east,
north, south, beyond.
Comparison or contrast: Comparison of two ideas e.g.: analogous to, better,
less than, less, like, either.
Conclusion: Ending the introduction of the idea and may have special impor-
tance e.g.: in summary, from this we see, last of all, hence, finally.
21
Fuzz: There is an idea that is not clear e.g.: looks like, seems like, alleged,
maybe, probably, sort of.
2.4 Description of Vectorizers
All the features extracted with the given sentence has to be converted into vectors
using vectorizers such as Count-vectorizers, TF-IDF vectorizers. The method
used to convert words to vectors is bag of words approach. Following are the two
vectorizers described below using bag of words approach.
2.4.1 Count Vectorizers
This module uses the counts of the words present within a sentence and converts
it into vectors by building the dictionary for the word to vector conversion(Harris,
1954). An illustrative of example count vectorizer is described below.
2.4.1.1 Example of Count Vectorizer
Consider the following two sentences.
a) John likes to watch movies. Mary likes movies too.
b) John also likes to watch football games.
Based on the above two sentences dictionary is constructed as follows:
{ John:1 , likes:2 , to:3 , watch:4 , movies:5 , also:6 , football:7 , games:8 , Mary:9
, too:10 }
The dictionary constructed has 10 distinct words. Using the indexes of the dictio-
nary, each sentence is represented by a 10-entry vector:
sentence1 : [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
sentence2 : [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
where each entry of the vectors refers to count of the corresponding entry in the
dictionary (this is also the histogram representation). For example, in the first
vector (which represents sentence 1), the first two entries are [1,2]. The first entry
corresponds to the word John which is the first word in the dictionary, and its
value is 1 because John appears in the first sentence 1 time. Similarly the second
22
entry corresponds to the word likes which is the second word in the dictionary,
and its value is 2 because likes appears in the first sentence 2 times. This vector
representation does not preserve the order of the words in the original sentences.
2.4.2 Term Frequency and Inverse Document Frequency
Term frequency and inverse document frequency describes importance of a partic-
ular word in the document or a sentence, in a collection of documents (Manning
et al.,2008).
Term-Frequency(Tf)-is defined as the number of occurrences of a particular word
within that document.
Inverse Document Frequency(IDF)-is defined as number of documents containing
the particular word.
For analysis in our work using tf-idf with bag of words approach, we treat each
document as a sentence.
Tf-idf is a short form of term frequency-inverse document frequency, is a
numerical statistic that is intended to reflect how important a particular word is
to a sentence, in a collection of sentences.
2.4.2.1 Formulation of Term Frequency and Inverse Document Fre-
quency
Term-Frequency formulation: The term frequency tf(t,d) describes the num-
ber of times that term t occurs in the sentence d. Two formulations of term fre-
quency is described below:
a)Boolean frequencies: tf(t,d) = 1 if t occurs in d and 0 otherwise.
b)logarithmically scaled frequency: tf(t,d) = 1 + log tf(t,d) if t occurs in d and 0
otherwise.
Inverse Document Frequency formulation: Inverse document frequency is
a measure of how much information a particular word provides in a sentence, in
comparison with the collection of sentences under consideration. Inverse document
frequency measures whether the term is common or rare across all collection of
23
sentences. Mathematically it is described as follows:
idf(t, D) = log N
|{d∈D:t∈d}|
N: total number of sentences in the collection of sentences.
{d ∈ D : t ∈ d} : number of sentences d where the term t appears (i.e., tf(t, d) =
0). If the term is not in the set of sentences, this will lead to a division-by-zero.
It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}|.
2.4.2.2 Description of Combination of TF and IDF
Then tf-idf is calculated as Tf-idf (t, d, D) = tf(t, d) ×idf(t, D)
A high weight in tf-idf is reached by a high term frequency (in the given sentence)
and a low document frequency of the term in the whole collection of sentences,
the weights hence tend to filter out common terms.
2.4.2.3 Example of TF-IDF Vectorizer
Consider term frequency tables (2.1) and (2.2) for a collection consisting of only
two sentences, as listed below.
Table 2.1: The words and their counts in the sentence1
Sentence1
Term Term Count
this 1
is 1
a 2
sample 1
Table 2.2: The words and their counts in the sentence2
Sentence2
Term Term Count
this 1
is 1
another 2
example 3
24
The calculation of tf-idf for the term this in sentence1 is performed as follows.
Term-frequency in its basic form, is just the frequency that we look up in appro-
priate table. In this case, it’s one for the term this in sentence1. IDF for the term
this in sentence1 is given as follows:
idf(this, D) = log N
|{d∈D:t∈d}|
The numerator of the fraction i.e. N, is the number of sentences which is two. The
number of sentences in which this appears is also two, giving the IDF as:
idf(this, D) = log 2
2
= 0
So the tf-idf value is zero for this term and with the basic definition this is true of
any term that occurs in all sentences.
Now consider the example term from the sentence2, which occurs three times but
in only one sentence that is sentence2. For this sample tf-idf of example term is
tf(example, sentence2) = 3
idf(example, D) = log 2
1
≈ 0.3010
tfidf(example, sentence2) = tf(example, sentence2)×ID(example, D) = 3×0.3010 ≈
0.9030
Chapter 3
Machine Learning Algorithms
Used For Analysis Of Business
Event Recognition
This chapter discusses the set of machine learning algorithms which were imple-
mented as part of our work. The semi-supervised approach with naive Bayes
expectation- maximization and active learning with QBC are used in our work to
increase the amount labeled data. Gradient boosting classifier, ada boost classifier,
random forest classifier, multilayered feed forward network and covolutional neu-
ral network are used to classify business event data in our work. So the following
sections would give us the detailed understanding regarding these algorithms.
3.1 Semi-supervised Learning using Naive Bayes
Classifier with Expectation-Maximization Al-
gorithm
In this approach first a naive Bayes classifier is built in the standard supervised
fashion from the limited amount of labeled training data and we perform classifica-
tion of the unlabeled data with the naive Bayes model, by noting the probabilities
associated with each class. Then we rebuild a new naive Bayes classifier using all
25
26
the labeled data and unlabeled using the estimated class probabilities as true class
labels. We iterate this process of classifying the unlabeled data and rebuilding the
naive Bayes model until it converges to a stable classifier, and the corresponding
set of labels for the unlabeled data are obtained. The algorithm is summarized
below as in (KamalNigam et al.,2006).
1. Inputs: collections Xl of labeled sentences and Xu of unlabeled sentences.
2. Build an initial naive Bayes classifier K*
from the labeled sentences Xl only.
3. Loop while classifier parameters improve, as measured by the change in
l(K|X, Y ), (the log probability of the labeled and unlabeled data and the
prior)
(a) (E-step) Use the current classifier, K*
, to estimate component member-
ship of each unlabeled sentence, i.e. the probability that each mixture
component (and class) generated each sentence P(Y = cj|X = xi;K*
),
where X and Y are random variables, cj output of jth class and xi is
ith input datapoint.
(b) (M-step) Re-estimate the classifier,K*
, given the estimated component
membership of each sentence. Use maximum a posteriori parameter
estimation to find K*
= arg max
K
P(X, Y |K) P(K)
4. Output is the classifier K*
, that takes an unlabeled sentence and predicts a
class label.
3.2 Active Learning using Ensemble classifiers
with QBC approach
The ensemble classifiers used for QBC approach are gradient boosting classifier,
ada boosting classifier and random forest classifier. Described below is this ap-
proach in brief.
27
3.2.1 Query by committee
In this approach an ensemble of hypotheses is learned and examples that cause
maximum disagreement amongst this committee (with respect to the predicted
categorization) are selected as the most informative examples from a pool of unla-
beled examples. QBC iteratively selects examples to be labeled for training and in
each iteration committee of classifiers based on current training set predict labels.
Then it evaluates the potential utility of each example in the unlabeled set, and
selects a subset of examples with the highest expected utility. The labels for these
examples are acquired and they are transferred to the training set. Typically,
the utility of an example is determined by some measure of disagreement in the
committee about its predicted label. This process is repeated until the number of
available requests for labels is exhausted.
3.3 Ensemble Models for Classification of Busi-
ness Events using Bag-Of-Words Approach
Series of classifiers that were trained on the dataset included SVM, decision-tree
classifier, random-forest classifier, ada boost classifier, gradient boosting classifier
and SGD classifier. Among these classifiers boosting classifiers and random forest
classifiers performed better compared to other classifiers. We used three ensemble
classifiers with decision tree as the base learner, namely gradientboostingclassi-
fier, adaboostclassifier and randomforestclassifier. In the end, classification of the
business event datasets was done by majority voting of these classifiers. The de-
scription and mathematical formulation for each ensemble classifier is given below.
3.3.1 Gradient Boosting Classifier
Boosting algorithms are set of machine learning algorithms, which builds strong
classifier from set of weak classifiers, typically decision tress. Gradient boosting is
one such algorithm which builds the model in a stage-wise fashion, and it general-
izes the model by allowing optimization of an arbitrary differentiable loss function.
28
The differentiable loss function in our case is Binomial deviance loss function.
The algorithm is implemented as follows as described in (Friedman et al.,2001).
Input : training set (Xi, yi), where i = 1....n , Xi ∈ H ⊆ Rn
and yi ∈ [−1, 1]
differential loss function L(y, F(X)) which in our case is Binomial deviance loss
function defined as log(1 + exp(−2yF(X))) and M are the number of iterations .
1. Initialize model with a constant value:
F0(X) =arg min
γ
n
i=1 L(yi, γ).
2. For m = 1 to M:
(a) Compute the pseudo-responses:
rim = − ∂L(yi,F(Xi))
∂F(Xi)
F(X)=Fm−1(X)
for i = 1, . . . , n.
(b) Fit a base learnerhm(X) to pseudo-response, train the pseudo response
using the training set {(Xi, rim)}n
i=1.
(c) Compute multiplier γm by solving the optimization problem:
γm = arg min
γ
n
i=1 L (yi, Fm−1(Xi) + γhm(Xi)).
(d) Update the model: Fm(X) = Fm−1(X) + γmhm(X).
3. Output FM (X) = M
m=1 γmhm(X)
The value of the weight γm is found by an approximated newton raphson solution
given as γm = Xi∈hm
rim
Xi∈hm|rim|(2−|rim|)
3.3.2 AdaBoost Classifier
In adaBoost we assign (non-negative) weights to points in the data set which
are normalized, so that it forms a distribution. In each iteration, we generate a
training set by sampling from the data using the weights, i.e. the data point (Xi, yi)
would be chosen with probability wi, where wi is the current weight for that data
point. We generate the training set by such repeated independent sampling. After
learning the current classifier, we increase the (relative) weights of data points that
are misclassified by the current classifier. We generate a fresh training set using the
modified weights and so on. The final classifier is essentially a weighted majority
29
voting by all the classifiers. The description of the algorithm as in (Freund et al.,
1995) is given below:
Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn
, yi ∈ [−1, 1]
1. Initialize: wi(1) = 1
n
, ∀i, each data point is initialized with equal weight, so
when data points are sampled from the probability distribution the chance
of getting the data point in the training set is equally likely.
2. We assume that there as M classifiers within the Ensembles.
For m=1 to M do
(a) Generate a training set by sampling with wi(m).
(b) Learn classifier hm using this training set.
(c) let ξm = n
i=1 wi(m) I[yi=hm(Xi)] where IA is the indicator function of
A and is defined as
IA = 1 if [yi = hm(Xi)]
IA = 0 if [yi = hm(Xi)]
so ξm is the error computed due to the mth classifier.
(d) Set αm=log(1−ξm
ξm
) computed hypothesis weight, such that αm > 0 be-
cause of the assumption that ξ < 0.5.
(e) Update the weight distribution over the training set as
wi(m + 1)= wi(m) exp(αmI[yi=hm(Xi)])
Normalization of the updated weights so that wi(m+1) is a distribution.
wi(m + 1) =
wi(m+1)
i wi(m+1)
end for
3. Output is final vote h(X) = sgn( M
m=1 αmhm(x)) is the weighted sum of all
classifiers in the ensemble.
In the adaboost algorithm M is a parameter. Due to the sampling with weights,
we can continue the procedure for arbitrary number of iterations. Loss function
used in adaboost algorithm is exponential loss function and for a particular data
point its defined as exp(−yif(Xi))
30
3.3.3 Random Forest Classifiers
Random forests are a combination of tree predictors, such that each tree depends
on the values of a random vector sampled independently, and with the same dis-
tribution for all trees in the forest. The main difference between standard decision
trees and random forest is, in decision trees, each node is split using the best split
among all variables and in random forest, each node is split using the best among
a subset of predictors randomly chosen at that node. In random forest classifier
ntree bootstrap samples are drawn from the original data, and for each obtained
bootstrap sample grow an unpruned classification decision tree, with the following
modification: at each node, rather than choosing the best split among all predic-
tors, randomly sample mtry of the predictors and choose the best split from among
those variables. Predict new data by aggregating the predictions of the ntree trees
(i.e., majority votes for classification). The algorithm is described as follows as
in(Brieman, 2001):
Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn
, where D is the whole
dataset.
for i=1,...,B:
1. Choose a boostrap sample Di from D.
2. Construct a decision Tree Ti from the bootstrap sample Di such that at each
node, choose a random subset of m features and only consider splitting on
those features.
Finally given the testdata Xt take the majority votes for classification. Here B is
the number of bootstrap data sets generated from original data set D.
3.4 Multilayer Feed Forward with Back Propa-
gation using word embedding approach
In this approach word embedding framework was used to convert word to vectors
and followed by applying MFN to classify the business event dataset. Genism
31
module in python was used to build this word embedding, using training of the
words on CBOW(continuous bag of words model) or skip gram model of the un-
supervised neural language model (Tomas Mikolov et al.,2013), where each word
is assigned with an uniformly distributed (U[-1,1]) 100 to 300 dimensonal vector.
Once we have initialized vectors for the each word using word embedding, using
window based approach, we can convert word vectors into a single global sen-
tence vector. The obtained global sentence vector is fed into MFN network with
back-propagation for classification of the sentences using soft-max classifier. The
following is implementation of the algorithm:
1. Initialization of each word in a sentence with a uniformly distributed (U[-
1,1]) dense vector of 100 to 300 dimension.
2. From a given set of words within a sentence, we concatenate word-embedding
vectors to form an matrix for that particular sentence.
3. Choosing an appropriate window size on the obtained matrix and corre-
spondingly applying max-pooling approach based on the window size we
finally obtain a global sentence vector.
4. The obtained global sentence vectors are fed into multilayer feed forward
network with back propagation using soft-max as the loss function. For
regularization of the multilayer feed forward network and to avoid overfitting
of the data, dropout mechanism is adopted.
3.5 Convolutional Neural Networks for Sentence
Classification with unsupervised feature vec-
tor learning
In this model a simple CNN is trained with one layer of convolution on top of
word vectors obtained from an unsupervised neural language model(Yoon kin,
2014). These vectors were trained by (Mikolov et al.,2013) on 100 billion words
32
Figure 3.1: The Image describes the architecture for Convolutional Neural
Network with Sentence Modelling for multichannel architecture
of Google news, and is a publicly available model. The following figure (3.1) de-
scribes the architecture of the CNN for sentence modeling.
let N be the number of sentences in the vocabulary and n be the number of words
in the particular sentence, where xi ∈ Rk
be the k-dimensional word vector corre-
sponding to the i-th word in the sentence. A sentence of length n (padded where
necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concate-
nation of words xi , xi+1 , . . . , xi+j. The weight vector w is initialized with
33
a random uniformly distributed matrix of size Rh×k
. A convolution operation
involves a filter weight matrix w, which is applied to a window of h words of a par-
ticular sentence to produce a new feature. For example, a feature ci is generated
from a window of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic
tangent. This filter is applied to each possible window of words in the sentence
[x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map.
c = [c1, c2, ..., cn−h+1]
with c ∈ Rn−h+1
, We then apply a max-pooling operation over the feature map
and take the maximum value c∗
= max[c] as the feature corresponding to this
particular filter. The idea is to capture the most important feature one with the
highest value for each feature map. This pooling scheme naturally deals with
variable sentence lengths. We have described the process by which one feature is
extracted from one filter. The model uses multiple filters (with varying window
sizes) to obtain multiple features. These features are also called as unsupervised
features, because they are obtained by applications of different filters with variable
window sizes randomly. These features form the penultimate layer and are passed
to a fully connected soft-max layer whose output is the probability distribution
over labels.
To avoid overfitting of CNN models, drop-out mechanism is adopted.
3.5.1 Variations in CNN sentence models
CNN-rand: Our baseline model where all words are randomly initialized and
then modified during training.
CNN-static: A model with pre-trained vectors from word2vec. All words in-
cluding the unknown ones that are randomly initialized are kept static and only
the other parameters of the model are learned. Initializing word vectors with those
34
obtained from an unsupervised neural language model is a popular method to im-
prove performance in the absence of a large supervised training set. We use the
publicly available word2vec vectors that were trained on 100 billion words from
Google news. The vectors have dimensionality of 300 and were trained using the
continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in
the set of pre-trained words are initialized randomly.
Chapter 4
Results and Discussions
In this chapter we discuss about the results obtained from the machine learning
algorithms that were applied in our work.
1. Semi-supervised learning approach using naive Bayes with expectation-maximization
and active learning with QBC to increase the number of labeled data points.
2. The ensemble classifiers, MFN and CNN models to classify the obtained
business data.
Described below are the results and analysis of the algorithms.
4.1 Semi-supervised Learning Implementation us-
ing Naive Bayes with Expectation Maximiza-
tion
Initially we had few data points which were labeled in supervised manner. To
formulate and solve the problem as Business Event Classification problem, our
primary objective was to increase the number of labeled data points.
In accordance with the algorithm of semi-supervised learning using naive Bayes
classifier with expectation maximization explained in section 3.1, the following are
the results in three domains of acquisition, vendor-supplier and job events with
the training data taken as 30%, 40% and 50% of the whole dataset and rest of the
pool as test data.
35
36
4.1.1 Results and Analysis of Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754. Stated below
are some of the observations made for large pool of unlabeled test data, by varying
the data points in test data and train data. Table (4.1) and figure (4.1) shows
the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.2),(4.4)
and(4.6) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and the remaining corresponding part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.3),(4.5) and (4.7) displays the ROC curves
for variations of 30% , 40% and 50 % as training data and corresponding remaining
part as the test data.
Analysis: We observe an increase in accuracy and F-scores as there is an in-
crease in the number of training data points which is as expected. But increase
in accuracies are of higher values compared to the increase in F-scores, because
number of true negatives are more in number compared to true positives. The
confusion matrix plot shows slight variations in number of true positives and true
negatives as the number of training data points are increased. The ROC curve
shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
37
Table 4.1: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data
Semi-supervised learning using naive Bayes for vendor-supplier dataset
Training data points in per-
centage
Accuracy F-scores Description on dataset
30 0.5597 0.5915 Testing data=527,training
data=227
40 0.7434 0.65 Testing data=454,training
data=300
50 0.7765 0.674 Testing data=376,training
data=376
Figure 4.1: Variations in Accuracies and F1-scores for Vendor-supplier data
using Naive-Bayes, semi-supervised technique
38
Figure 4.2: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP
Figure 4.3: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP
39
Figure 4.4: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP
Figure 4.5: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP
40
Figure 4.6: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP
Figure 4.7: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP
41
4.1.2 Results and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2810 data points. Stated
below some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data. Table (4.2) and figure (4.8)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.9),(4.11)
and(4.13) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.10),(4.12) and (4.14) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is an increase in the number of training data, we observe
an increase in accuracy and F-scores. But there is a vast difference in values of
accuracies compared to F-scores, because the number of true negatives are very
high in comparison to true positives which are low in number, which is clearly
visible in our confusion matrix plot. The ROC curve shows an increase in TPR
and area under the ROC curve for increase in the number of training datapoints.
42
Table 4.2: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data
Semi-supervised learning using naive Bayes for Job dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7483 0.4444 Testing data=1967,training
data=842
40 0.7544 0.4863 Testing data=1686,training
data=1123
50 0.8014 0.52 Testing data=1405,training
data=1404
Figure 4.8: Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique
43
Figure 4.9: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB
Figure 4.10: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB
44
Figure 4.11: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB
Figure 4.12: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB
45
Figure 4.13: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB
Figure 4.14: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB
46
4.1.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Stated below are some of the observations made for large pool of unlabeled test
data, by varying the data points in test data and train data. Table (4.3) and
figure (4.15) shows the variations of accuracies and F-scores for 30% , 40% and
50 % as training data and corresponding remaining part as the test data. The
figures (4.16),(4.18) and(4.20) displays the confusion matrix for variations of 30%
, 40% and 50 % as the training data and corresponding remaining part as the test
data. The confusion matrix gives insights regarding the number of true-positives,
true-negatives, false-positives and false-negatives. Figures (4.17),(4.19) and (4.21)
displays the ROC curves for variations of 30% , 40% and 50 % as the training data
and corresponding remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are slightly higher
compared to the increase in accuracies. Because number of true positives are
more in compared to true negatives, due to this classifier is more biased towards
the positive class compared to negative. So the amount of false positives are higher
in this scenario, which is clearly visible from the confusion matrix plots. The ROC
curve shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
47
Table 4.3: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data
Semi-supervised learning using naive Bayes for Acquisition dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7929 0.8178 Testing data=966,training
data=413
40 0.7989 0.82 Testing data=828,training
data=521
50 0.8057 0.8241 Testing data=689,training
data=690
Figure 4.15: Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique
48
Figure 4.16: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition
Figure 4.17: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition
49
Figure 4.18: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition
Figure 4.19: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition
50
Figure 4.20: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition
Figure 4.21: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition
51
4.2 Active Learning implementation by Query
by committee approach
In accordance with the algorithm of active learning explained in section 3.1.2,
following are some of the results in three domains of acquisition, vendor-supplier
and job events with the training data taken as 30% ,40% and 50 % of the whole
dataset and prediction of test data using majority voting of three ensemble clas-
sifiers gradient boosting classifier, ada boost classifier and random forest classifier
(i.e. query by committee approach).
4.2.1 Results and Analysis for Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.4) and figure (4.22)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.23),(4.25)
and(4.27) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.24),(4.26) and (4.28) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: We observe an increase in accuracy and F-scores, as there is increase
in the number of training data. But increase in accuracies are of higher values
compared to the increase in F-scores because number of true negatives are more in
compared to true positives. This method performs better compared to the semi-
supervised naive Bayes classifier. The confusion matrix plot shows slight variations
in number of true positives and true negatives as the number of training data points
are increased. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints.
52
Table 4.4: Variation in accuracies and F-scores using Active Learning for
Vendor-supplier event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.842 0.7348 Testing data=529,training
data=225
40 0.84 0.7352 Testing data=454,training
data=300
50 0.8643 0.76 Testing data=376,training
data=376
Figure 4.22: variations in Accuracies and F1-scores for Vendor-supplier data
using Active learning
53
Figure 4.23: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier
Figure 4.24: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier
54
Figure 4.25: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier
Figure 4.26: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier
55
Figure 4.27: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier
Figure 4.28: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier
56
4.2.2 Result and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2809 data points. Fol-
lowing are some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data.Table (4.5) and figure (4.29)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.30),(4.32)
and(4.34) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.31),(4.33) and (4.35) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is increase in the number of training data, we observe an
increase in accuracy and F-scores. But there is a vast difference in accuracies
compared to F-scores, because number of true negatives are very high in compared
to true positives which are low in number which is clearly visible in our confusion
matrix plot. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints. The performance of this
method is better compared to the semi-supervised naive Bayes classifier which is
clearly visible from our results.
57
Table 4.5: Variation in accuracies and F-scores using Active Learning for Job
event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.9054 0.6204 Testing data=1967,training
data=842
40 0.9116 0.6558 Testing data=1686,training
data=1123
50 0.9216 0.6758 Testing data=1405,training
data=1404
Figure 4.29: Variations in Accuracies and F1-scores for Job event data using
Active learning
58
Figure 4.30: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job
Figure 4.31: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job
59
Figure 4.32: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job
Figure 4.33: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job
60
Figure 4.34: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job
Figure 4.35: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job
61
4.2.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.6) and figure (4.36)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.37),(4.39)
and(4.41) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.38),(4.40) and (4.42) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are equivalent to the
increase in accuracies. The confusion matrix plots show that the number of true
positives and true negatives are nearly equal in number. The ROC curve shows an
increase in TPR and area under the ROC curve for increase in the number of train-
ing datapoints. This method shows slight improvement in accuracies compared to
the semi-supervised naive Bayes classifier.
62
Table 4.6: Variation in accuracies and F-scores using Active Learning for
Acquisition event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7855 0.7549 Testing data=966,training
data=413
40 0.812 0.7867 Testing data=828,training
data=521
50 0.82 0.7995 Testing data=689,training
data=690
Figure 4.36: Variations in Accuracies and F1-scores for Acquisition event data
using Active learning
63
Figure 4.37: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition
Figure 4.38: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition
64
Figure 4.39: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition
Figure 4.40: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition
65
Figure 4.41: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job
Figure 4.42: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job
4.3 Comparison of Semi-supervised techniques
and Active learning approach
Active learning using query by committee approach that comprises of three en-
semble classifiers that form the committee (i.e. gradient boosting, ada boost and
66
randomforest classifiers) performs better compared to semi-supervised naive Bayes
probabilistic model in terms of accuracy, for all three domains of acquisition event,
vendor-supplier event and job event, and for the corresponding variations in train-
ing data as 30%, 40% and 50% . Active learning using query with committee
approach was implemented in our work to increase the number of labeled data
points from the pool of unlabeled data.
4.4 Results of Ensemble Classifiers with differ-
ent Parameter tuning
The parameter to be tuned was the number base-learners used within ensem-
ble classifier. The base-learners in our case were decision trees. Following are
some observations with 100 and 500 base-learning classifiers within the ensemble,
which were varied as part of parameter tuning. The ensemble classifiers used were
gradient boosting classifiers, adaboost classifiers and random forest classifiers for
implementation. Finally the majority voting of these three ensemble classifiers
was performed to predict the test data in order to increase the accuracy of test
data.
4.4.1 Analysis for vendor-supplier event Data using 100
estimators within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was around 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
parameter. The figure(4.43) and the tables (4.7), (4.8) and (4.9) display the vari-
ation in training score i.e. 5-fold, accuracies and F-scores for gradient Boosting
classifier, ada boosting classifier and random forest classifier.
67
Table 4.7: Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in vendor-supplier data set
Gradient Boosting classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.9084 0.8248
2 0.8936 0.8119
3 0.8921 0.8033
4 0.8921 0.7906
5 0.90384 0.8181
Table 4.8: Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in vendor-supplier data
Ada-boost classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.8862 0.7979
2 0.8833 0.7937
3 0.8788 0.7743
4 0.8788 0.7783
5 0.8994 0.8055
Table 4.9: Variation in accuracies and F-scores for random forest classifier for
number of parameter estimate as 100 in vendor-supplier data
Random forest classifier Vendor-supplier
5-folds Accuracy F-scores
1 0.9143 0.8309
2 0.9054 0.8192
3 0.8995 0.8045
4 0.9158 0.8209
5 0.9082 0.8152
68
Figure 4.43: variations in Accuracies and F1-scores for Vendor-supplier data
for 5-fold using 3 ensemble classifiers
The following table (4.10) describes the test score for vendor-supplier data with
100 estimators, obtained by majority voting of three ensemble classifiers gradient
boosting, ada boost and random forest classifiers. The figure (4.44) shows the
confusion matrix display for the test data, which gives the understanding regard-
ing the true-positives, true-negatives, false-positives and false-negatives.The figure
(4.45) displays the ROC curve.
Table 4.10: Variation in test score for accuracy and F-score with Vendor-
supplier data using voting of three ensemble classifiers with number of estimators
as 100
Test score for Vendor-supplier using voting of three ensemble classifiers with number of estimators as 100
Area under ROC Accuracy F-scores Confusion matrix values
87% 90.9% 83.511% truepositives=195,falsepositives=16,truenegatives=575,
falsenegatives=61
69
Figure 4.44: Confusion matrix for Vendor-supplier with number of estimators
as 100
Figure 4.45: Roc curve for for Vendor-supplier with number of estimators as
100
4.4.2 Analysis for Job event Data using 100 estimators
within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was around 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
70
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
parameter. The figure(4.46) and the tables (4.11), (4.12) and (4.13) display the
variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting
classifier, ada boosting classifier and random forest classifier.
Table 4.11: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for number of parameter estimate as 100 in Job data set
Gradient Boosting classifier for Job
5-folds Accuracy F-scores
1 0.8670 0.7593
2 0.8961 0.7710
3 0.8743 0.77
4 0.8797 0.75
5 0.8870 0.7686
Table 4.12: Variation in accuracies and F-scores for Ada Boosting classifier
for number of parameter estimate as 100 in Job data set
Ada-boost classifier for Job
5-folds Accuracy F-scores
1 0.8561 0.7748
2 0.8907 0.7744
3 0.8761 0.7671
4 0.8688 0.7592
5 0.8925 0.7958
Table 4.13: Variation in accuracies and F-scores for Random forest classifier
for number of parameter estimate as 100 in Job data set
Random forest classifier Job
5-folds Accuracy F-scores
1 0.8888 0.7664
2 0.8816 0.7530
3 0.8943 0.7870
4 0.9052 0.7846
5 0.8998 0.7971
71
Figure 4.46: Variations in Accuracies and F1-scores for Job data for 5-fold
using 3 ensemble classifiers
The following table (4.14) describes the test score for job data with 100 estima-
tors, obtained by majority voting of three ensemble classifiers gradient boosting,
ada boost and random forest classifiers. The figure (4.47) shows the confusion
matrix display for the test data, which gives the understanding regarding the
true-positives, true-negatives, false-positives and false-negatives. The figure (4.48)
displays the ROC curve.
Table 4.14: variation in test score for accuracy and F-score with Job data
using voting of three ensemble classifiers with number of estimators as 100
Test score for Job using voting of three ensemble classifiers with number of estimators as 100
Area under ROC Accuracy F-scores Confusion matrix values
83% 90.3% 81.97% truepositives=141,falsepositives=15,truenegatives=484,
falsenegatives=47
72
Figure 4.47: Confusion matrix for Job with number of estimators as 100
Figure 4.48: Roc curve for for Job with number of estimators as 100
4.4.3 Analysis for Acquisition event Data using 100 esti-
mators within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was round 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
73
parameter. The figure(4.49) and the tables (4.15), (4.16) and (4.17) display the
variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting
classifier, ada boosting classifier and random forest classifier.
Table 4.15: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for number of parameter estimate as 100 in Acquisition data set
Gradient Boosting classifier for Acquisition
5-folds Accuracy F-scores
1 0.9173 0.8699
2 0.9230 0.8605
3 0.9301 0.8657
4 0.9031 0.8416
5 0.9230 0.8606
Table 4.16: Variation in accuracies and F-scores for Random forest classifier
for number of parameter estimate as 100 in Acquisition data set
Random forest classifier for Acquisition
5-folds Accuracy F-scores
1 0.9287 0.9026
2 0.9287 0.8803
3 0.9472 0.9226
4 0.9273 0.8939
5 0.9458 0.9037
Table 4.17: Variation in accuracies and F-scores for Ada Boosting classifier
for number of parameter estimate as 100 in Acquisition data set
Ada-boost Classifier for Acquisition
5-folds Accuracy F-scores
1 0.9330 0.8941
2 0.9245 0.8798
3 0.9430 0.8924
4 0.9245 0.8819
5 0.9344 0.8931
74
Figure 4.49: Variations in Accuracies and F1-scores for Acquisition data for
5-fold using 3 ensemble classifiers
The following table (4.18) describes the test score for acquisition data with 100 es-
timators, obtained by majority voting of three ensemble classifiers gradient boost-
ing, ada boost and random forest classifiers. The figure (4.50) shows the confu-
sion matrix display for the test data, which gives the understanding regarding the
true-positives, true-negatives, false-positives and false-negatives. The figure (4.51)
displays the ROC curve.
Table 4.18: Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of estimators as 100
Test score for Acquisition using voting of three ensemble classifiers with number of estimators as 100
Area under ROC Accuracy F-scores Confusion matrix values
89% 92.25% 87.022% truepositives=228,falsepositives=17,truenegatives=582,
falsenegatives=51
75
Figure 4.50: Confusion matrix for Acquisition with number of estimators as
100
Figure 4.51: Roc curve for for Acquisition with number of estimators as 100
4.4.4 Analysis for Vendor-Supplier event Data using 500
estimators within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was around 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
76
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
parameter. The figure(4.52) and the tables (4.19), (4.20) and (4.21) display the
variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting
classifier, ada boosting classifier and random forest classifier.
Table 4.19: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for number of parameter estimate as 500 in vendor-supplier data set
Gradient Boosting classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.8892 0.7761
2 0.8936 0.7906
3 0.8995 0.7975
4 0.9054 0.8083
5 0.8934 0.8091
Table 4.20: Variation in accuracies and F-scores for Ada Boosting classifier
for number of parameter estimate as 500 in vendor-supplier data set
Ada-boost classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.9010 0.8112
2 0.8862 0.7924
3 0.9054 0.8202
4 0.9054 0.8181
5 0.8846 0.8088
Table 4.21: Variation in accuracies and F-scores for Random forest classifier
for number of parameter estimate as 500 in vendor-supplier data set
Random forest classifier Vendor-supplier
5-folds Accuracy F-scores
1 0.8966 0.7818
2 0.9054 0.8235
3 0.9098 0.8085
4 0.9054 0.8096
5 0.9038 0.8380
77
Figure 4.52: Variations in Accuracies and F1-scores for Vendor-supplier data
for 5-fold using 3 ensemble classifiers
The following table (4.22) describes the test score for vendor-supplier data with
500 estimators, obtained by majority voting of three ensemble classifiers gradient
boosting, ada boost and random forest classifiers. The figure (4.53) shows the
confusion matrix display for the test data, which gives the understanding regard-
ing the true-positives, true-negatives, false-positives and false-negatives and figure
(4.54) displays the ROC curve.
Table 4.22: Variation in test score for accuracy and F-score with vendor-
supplier data using voting of three ensemble classifiers with number of estimators
as 500
Test score for Vendor-supplier using voting of three ensemble classifiers with number of estimators as 500
Area under ROC Accuracy F-scores Confusion matrix values
88% 91.97% 85.211% truepositives=196,falsepositives=16,truenegatives=583,
falsenegatives=52
78
Figure 4.53: Confusion matrix for Vendor-supplier with number of estimators
as 500
Figure 4.54: Roc curve for for Vendor-supplier with number of estimators as
500
4.4.5 Analysis for Job event Data using 500 estimators
within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was around 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
79
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
parameter. The figure(4.55) and the tables (4.23), (4.24) and (4.25) display the
variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting
classifier, ada boosting classifier and random forest classifier.
Table 4.23: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for number of parameter estimate as 500 in Job data set
Gradient Boosting classifier for Job
5-folds Accuracy F-scores
1 0.8943 0.7944
2 0.9107 0.8297
3 0.9016 0.7812
4 0.9107 0.8167
5 0.9034 0.81
Table 4.24: Variation in accuracies and F-scores for Ada Boosting classifier
for number of parameter estimate as 500 in Job data set
Ada-boost classifier for Job
5-folds Accuracy F-scores
1 0.8834 0.7894
2 0.8888 0.7859
3 0.9016 0.8029
4 0.8870 0.7769
5 0.8816 0.7653
Table 4.25: Variation in accuracies and F-scores for Random forest classifier
for number of parameter estimate as 500 in Job data set
Random forest classifier Job
5-folds Accuracy F-scores
1 0.8761 0.7585
2 0.9016 0.7806
3 0.8888 0.7698
4 0.8888 0.7630
5 0.9016 0.7704
80
Figure 4.55: Variations in Accuracies and F1-scores for Job data for 5-fold
using 3 ensemble classifiers
The following table (4.26) describes the test score for job data with 500 estima-
tors, obtained by majority voting of three ensemble classifiers gradient boosting
, ada boost and random forest classifiers. The figure (4.56) shows the confu-
sion matrix display for the test data, which gives the understanding regarding the
true-positives, true-negatives, false-positives and false-negatives. The figure (4.57)
displays the ROC curve.
Table 4.26: variation in test score for accuracy and F-score with Job data
using voting of three ensemble classifiers with number of estimators as 500
Test score for Job using voting of three ensemble classifiers with number of estimators as 500
Area under ROC Accuracy F-scores Confusion matrix values
87.56% 92.3% 83.88% truepositives=149,falsepositives=16,truenegatives=486,
falsenegatives=36
81
Figure 4.56: Confusion matrix for Job with number of estimators as 500
Figure 4.57: Roc curve for for Job with number of estimators as 500
4.4.6 Analysis for Acquisition event Data using 500 esti-
mators within the ensemble as the parameter
The number of data points obtained after implementation of active learning ap-
proach was around 4500 data points. Following are the observations obtained with
three ensemble classifiers using the bag of words approach with 80% as training
data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
82
parameter. The figure(4.58) and the tables (4.27), (4.28) and (4.29) display the
variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting
classifier, ada boosting classifier and random forest classifier.
Table 4.27: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for number of parameter estimate as 500 in Acquisition data set
Gradient Boosting classifier for Acquisition
5-folds Accuracy F-scores
1 0.9230 0.8693
2 0.9245 0.8564
3 0.9259 0.8735
4 0.9131 0.86238
5 0.9216 0.8648
Table 4.28: Variation in accuracies and F-scores for Random forest classifier
for number of parameter estimate as 500 in Acquisition data set
Random forest classifier for Acquisition
5-folds Accuracy F-scores
1 0.9458 0.9121
2 0.9373 0.8976
3 0.9444 0.9107
4 0.9188 0.8758
5 0.9415 0.9061
Table 4.29: Variation in accuracies and F-scores for Ada boosting classifier
for number of parameter estimate as 500 in Acquisition data set
Ada-boost Classifier for Acquisition
5-folds Accuracy F-scores
1 0.9358 0.8872
2 0.9316 0.8953
3 0.9358 0.8878
4 0.9230 0.8717
5 0.9387 0.8997
83
Figure 4.58: Variations in Accuracies and F1-scores for Acquisition data for
5-fold using 3 ensemble classifiers
The following table (4.30) describes the test score for acquisition data with 500 es-
timators, obtained by majority voting of three ensemble classifiers gradient Boost-
ing, ada boost and random forest classifiers. The figure (4.59) shows the confu-
sion matrix display for the test data, which gives the understanding regarding the
true-positives, true-negatives, false-positives and false-negatives. The figure (4.60)
displays the ROC curve.
Table 4.30: Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of estimators as 500
Test score for Acquisition using voting of three ensemble classifiers with number of estimators as 500
Area under ROC Accuracy F-scores Confusion matrix values
92% 94.21% 91.10% truepositives=245,falsepositives=8,truenegatives=591,
falsenegatives=34
84
Figure 4.59: Confusion matrix for Acquisition with number of estimators as
500
Figure 4.60: Roc curve for for Acquisition with number of estimators as 500
4.5 Final Accuracies and F-score estimates for
the model
Once the parameters are tuned, the test analysis of the whole data set was per-
formed. In this process we train the model using whole dataset, check the overall
85
accuracies and F-scores using 5 fold cross-validation. The overall analysis is dis-
cussed in subsequent subsections.
4.5.1 Final Analysis of Vendor-Supplier Dataset
Vendor-supplier dataset consists of around 4500 data points. After tuning of the
parameter i.e. the number of estimators to value of 500, which gave us the best
performance by cross-validation. The following are results obtained with this
tuned parameter value on the whole data set. The figure(4.61) and the tables
(4.31), (4.32) ,(4.33) display the variation in training score i.e. 5-fold, whole data
set accuracies and F-scores, for gradient boosting classifier, ada boosting classifier
and random forest classifier.
Table 4.31: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for whole vendor-supplier data set
Gradient Boosting classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.9067 0.83511
2 0.8912 0.7925
3 0.8947 0.8044
4 0.9172 0.8529
5 0.9219 0.8538
Table 4.32: Variation in accuracies and F-scores for Ada Boosting classifier
for whole vendor-supplier data set
Ada-boost classifier for Vendor-supplier
5-folds Accuracy F-scores
1 0.9020 0.8329
2 0.8947 0.8026
3 0.8841 0.7958
4 0.9054 0.8336
5 0.8983 0.8123
86
Table 4.33: Variation in accuracies and F-scores for Random forest classifier
for whole vendor-supplier data set
Random forest classifier Vendor-supplier
5-folds Accuracy F-scores
1 0.9173 0.8395
2 0.9018 0.8019
3 0.9030 0.8137
4 0.9148 0.8509
5 0.9137 0.8212
Figure 4.61: Variations in Accuracies and F1-scores for Vendor-supplier data
for whole data set
4.5.2 Final Analysis of Job Dataset
Job dataset consists of around 4500 data points. After tuning of the parameter
i.e. the number of estimators to value of 500, which gave us the best performance
87
by cross-validation. The following are results obtained with this tuned parameter
value on the whole data set. The figure(4.62) and the tables (4.34), (4.35), (4.36)
display the variation in training score i.e. 5-fold, whole data set accuracies and
F-scores, for gradient boosting classifier, ada boosting classifier and random forest
classifier.
Table 4.34: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for whole Job data set
Gradient Boosting classifier for Job
5-folds Accuracy F-scores
1 0.9053 0.8134
2 0.9010 0.7841
3 0.9125 0.8125
4 0.9139 0.8478
5 0.9154 0.8429
Table 4.35: Variation in accuracies and F-scores for Ada Boosting classifier
for whole Job data set
Ada-boost classifier for Job
5-folds Accuracy F-scores
1 0.9112 0.8235
2 0.9010 0.7952
3 0.8848 0.7830
4 0.9037 0.8272
5 0.9023 0.8152
88
Table 4.36: Variation in accuracies and F-scores for Random forest classifier
for whole Job data set
Random forest classifier Job
5-folds Accuracy F-scores
1 0.9112 0.8051
2 0.8981 0.8
3 0.8965 0.7801
4 0.9139 0.8209
5 0.8921 0.81
Figure 4.62: Variations in Accuracies and F1-scores for Job data for whole
data set
4.5.3 Final Analysis of Acquisition Dataset
Acquisition dataset consists of around 4500 data points. After tuning of the pa-
rameter i.e. the number of estimators to value of 500, which gave us the best
89
performance by cross-validation. The following are results obtained with this
tuned parameter value on the whole data set. The figure(4.63) and the tables
(4.37), (4.38), (4.39) display the variation in training score i.e. 5-fold, whole data
set accuracies and F-scores, for gradient boosting classifier, ada boosting classifier
and random forest classifier.
Table 4.37: Variation in accuracies and F-scores for Gradient Boosting clas-
sifier for whole Acquisition data set
Gradient Boosting classifier for Acquisition
5-folds Accuracy F-scores
1 0.9202 0.8689
2 0.9373 0.8924
3 0.9396 0.8846
4 0.9429 0.9084
5 0.9293 0.8872
Table 4.38: Variation in accuracies and F-scores for Random forest classifier
for whole Acquisition data set
Random forest classifier for Acquisition
5-folds Accuracy F-scores
1 0.9328 0.8851
2 0.9544 0.9161
3 0.9487 0.9150
4 0.9452 0.9100
5 0.9441 0.9039
90
Table 4.39: Variation in accuracies and F-scores for Ada boosting classifier
for whole Acquisition data set
Ada-boost Classifier for Acquisition
5-folds Accuracy F-scores
1 0.9362 0.8992
2 0.9464 0.9101
3 0.9362 0.8943
4 0.9407 0.9057
5 0.9395 0.9016
Figure 4.63: Variations in Accuracies and F1-scores for Acquisition data 5-
folds accuracy variations for whole data set
91
4.6 Results obtained for MFN with Word Em-
bedding
The results obtained for this model was satisfactory. The classifier prediction for
this model was not accurate. The global sentence vector was generated based on
code logic from the set of word-embedding vectors. Following is illustration of
vendor-supplier test scores on a test data set of 225 data points. Similarly we
obtained failure results for job and acquistion events.
Table 4.40: Variation in test score for MFN with word embedding
Test score for MFN with word embedding on vendor-supplier dataset
Accuracy F-score Confusion matrix values
0.65 .39 True-negatives=140, True-positive = 13,
false-positives = 3, false-negatives = 69
4.7 Results obtained for Convolutional Neural
Networks
In convolutional neural network analysis is performed for each word initialized with
uniformly distributed random vector U[−1, 1] i.e. CNN-rand and CNN-word2vec
models which are as described in the section(3.5). Following displayed are the
results and analysis for both CNN models with 3-fold cross validation on the
whole data set.
4.7.1 Analysis for Vendor-Supplier Data using CNN-rand
and CNN-word2vec Model
Shape of the input matrix for vendor-supplier was 2515×300, which was maximum
sentence length × dimension of the corresponding word. The filter shapes used to
extract features were 3×300, 4×300 and 5×300. The dimension of the hidden
units was 100×2 dimension. The activation function used was RELU. Drop-out
92
rate is 0.5 and learning rate of 0.95. Following are results for CNN-rand and CNN-
word2vec for 3-fold cross validation. The table (4.40) and figure (4.64) shows the
variation in accuracies and F-scores CNN-rand and CNN-word2vec models.
Table 4.41: Variation in accuracies and F-scores CNN-rand and CNN-
word2vec models for Vendor-supplier on whole data set
CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set
3-folds CNN-rand accu-
racy
CNN-word2vec
accuracy
1 0.90049 0.91044
2 0.9168 0.9193
3 0.9070 0.92035
The average accuracy for CNN-rand is 0.9081 and for CNN-word2vec is 0.9167.
Figure 4.64: CNN-rand and CNN-word2vec models for Vendor-supplier on
whole data set with 3-folds
93
4.7.2 Analysis for Acquisition Data using CNN-rand and
CNN-word2vec Model
Shape of the input matrix for acquisition was 580×300, which was maximum
sentence length × dimension of the corresponding word. The filter shapes used
to extract features were 3×300, 4×300 and 5×300. The dimension of the hidden
units was 100×2 dimension. The activation function used was RELU. Drop-out
rate is 0.5 and learning rate of 0.95. Following are results for CNN-rand and CNN-
word2vex for 3-fold cross validation. The table (4.41) and figure (4.65) shows the
variation in accuracies and F-scores CNN-rand and CNN-word2vec models.
Table 4.42: Variation in accuracies and F-scores CNN-rand and CNN-
word2vec models for Acquisition on whole data set
CNN-rand and CNN-word2vec models for Acquisition on whole data set
3-folds CNN-rand accu-
racy
CNN-word2vec
accuracy
1 0.9439 0.9672
2 0.9251 0.9705
3 0.9386 0.9613
The average accuracy for CNN-rand is 0.9359 and for CNN-word2vec is 0.966.
4.7.3 Analysis for Job using CNN-rand and CNN-word2vec
Model
Shape of the input matrix for job was 1192×300, which was maximum sentence
length × dimension of the corresponding word. The filter shapes used to extract
features were 3×300, 4×300 and 5×300. The dimension of the hidden units was
100×2 dimension. The activation function used was RELU. Drop-out rate was 0.5
and learning rate of 0.95. Following are results for CNN-rand and CNN-word2vec
for 3-fold cross validation. The table (4.42) and figure (4.66) shows the variation
in accuracies and F-scores CNN-rand and CNN-word2vec models.
94
Figure 4.65: CNN-rand and CNN-word2vec models for Acquisition on whole
data set with 3-folds
Table 4.43: Variation in accuracies and F-scores CNN-rand and CNN-
word2vec models for Job on whole data set
CNN-rand and CNN-word2vec models for Job on whole data set
3-folds CNN-rand accu-
racy
CNN-word2vec
accuracy
1 0.7951 0.8226
2 0.8005 0.7941
3 0.8181 0.8357
The average accuracy for CNN-rand is 0.8046 and for CNN-word2vec is 0.8108.
95
Figure 4.66: CNN-rand and CNN-word2vec models for Job on whole data set
with 3-folds
4.8 Result Analysis
Given below is the description of results and analysis:
1. Active learning using query by committee approach comprises of three en-
semble classifiers that form the committee (i.e. gradient boosting, ada boost
and randomforest classifiers) performs better compared to semi-supervised
naive Bayes probabilistic model.
2. All the three ensemble classifiers performance in accuracies and F-scores were
consistent and good for the respective business event datasets.
3. CNN-word2vec models performed better compared to the CNN-rand models
on all three business event datasets.
Chapter 5
Conclusions and Future work
Extraction of vital information from unstructured text is a hard problem. In the
following sections, we discuss challenges encountered in our work of business event
recognition, summary of our work and the future scope of our work.
5.1 Challenges Encountered in Business Event
Recognition
The identification of business event sentences from online news articles is tedious
and difficult task. Following are the discussions regarding the challenges encoun-
tered while performing our work on business event recognition.
1. Uncertainty in extracting amount of data and all the variations
possible in the business event data: The data extraction should have
been in an appropriate manner, to include all possible variations of business
news describing that particular event. So that classification of business event
could have been performed under all variations. This to a certain extent
was not possible, because we wrote crawlers to extract business news data
from different websites. Information extracted from this crawled text was
labeled and used for training of the model, so the models failed to capture
the business event sentences which were not possibly present in the training
model as one of the variant of business event. This lead to increased number
of false negatives compared to false positives which reflects in our models.
97
98
2. Application of active learning methods was time consuming.: The
method of active learning involved querying of most informative examples by
the user and after querying we retrain our classifiers with this newly added
labeled data points. And the above process was repeated. The process was
time consuming, because we required a domain expertise to query the labels.
3. Business event datasets were unstructured: Understanding of the pat-
terns and extracting useful features from the business event datasets were
difficult because they were unstructured.
4. Bag of words vectorizers fail to capture the exact meaning of the
word.: In the bag of words approach there was a disregard for both gram-
mar and capturing exact meaning of the word within the sentence. So the
analysis performed using this approach in business event recognition had its
drawbacks which is illustrated as shown below:
example:The model will recognize both the sentences as described below as
an acquisition event, because meaning of acquisition was not distinct in bag
of words model.
a)Google acquired a land for developing its office.
b)Google acquired yahoo as an organization.
So we had to develop models which would overcome this problem in classi-
fication of business event which was challenging.
5. Restricted analysis on CNN models: The algorithm runtime for CNN
models with 3-fold cross-validation was around 20 hours(on a confiuration
of core-i5 processor) for each of the business event datasets. So the analysis
was restricted to 3-fold cross validation each of the respective business event
datasets.
5.2 Conclusions
An automated model for recognizing business events in three domains i.e. acquisi-
tion, vendor-supplier and job was developed which was the main objective of our
99
project. Our developed model was able to predict the business event sentences
and give out additional information such as organizations and persons involved in
that business event sentence which was our desired output.
In the bag of words approach, Tf-idf vectorizers performed better compared
to the count-vectorizers when they were used in the analysis of the ensemble
classifiers to classify the business event data.
In the conversion of word to vectors using the word-embedding and word2vec,
models capture the context of the sentence to a certain extent in comparison the
bag of words model which fail, because of disregard of order and grammar of a
word within a sentence.
In the semi-supervised approach from our results and analysis we find that
the active learning approach using ensemble learning gives us better results com-
pared to naive Bayes classifier with expectation maximization in all the three
domains of business events i.e. acquisition, vendor-supplier and job.
In the acquisition business event CNN models with cross validation on the
whole data set gives better accuracies compared to the ensemble classifiers with
cross validation on the whole data set, the accuracies for vendor-supplier dataset
is nearly same in both the cases and for the job event dataset ensemble classifiers
perform better compared to the CNN models.
5.3 Future works
As future scope of our project the three major issues we can take up are, the
problem of co-reference resolution, exhaustive analysis on the CNN models and
the application of HMM on our model. Stated below is the description of these
three future works.
1. The problem of co-referencing resolution exits in our model, which is the
identification of noun phrases and other terms that refer to the nouns such
as her,him,it,that,their,them etc.
example: ISRO acquired their organization.
100
here what organization ”their” is referring to is unknown, which is the prob-
lem of coreference resolution.
2. CNN model is restricted to 3-fold cross-validation for whole data set after
parameter tuning with three fold cross validation, further exhaustive analysis
can be performed using 5-fold and 10-fold cross validation to improve the
overall performance of the model.
3. After the classification of business event sentences. If a false positive is
obtained which contains the same keyword as in the true positive, then such
kind of false positives can be removed by building of HMM models. The
illustrative example is described below.
a)Google has acquired a plot in U.S.A
b)Google is going purchase a land.
These two sentences are false positives, so probablistic model using HMM can
be applied on such kind of false positive classified sentences and converted
into true negatives.
Bibliography
[1] Marujo, Luis, Wang Ling, Anatole Gershman, Jaime Carbonell, and Joo P.
Neto2 David Matos. Recognition of Named-Event Passages in News Articles.
In 24th International Conference on Computational Linguistics, pp.321-329.
2012.
[2] Marujo, Luis, Anatole Gershman, Jaime Carbonell, Robert Frederking, and
Joo P. Neto. Supervised topical key phrase extraction of news stories using
crowdsourcing, light filtering and co-reference normalization.In proceedings of
8th international conference on Language Resources and Evaluvation(LREC)
,pp.156-162. 2012.
[3] Su, Jiang, Jelber S. Shirab, and Stan Matwin. Large scale text classification
using semi-supervised multinomial naive bayes. In Proceedings of the 28th In-
ternational Conference on Machine Learning (ICML-11), pp. 97-104. 2011.
[4] Kim, Yoon. Convolutional Neural Networks for Sentence Classifica-
tion.Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), pp. 1746-1751. 2014.
[5] Nigam, Kamal, Andrew McCallum, and Tom Mitchell. Semi-supervised text
classification using EM. Semi-Supervised Learning,pp 33-56. 2006.
[6] Friedman, Jerome H.Greedy function approximation: a gradient boosting ma-
chine. Annals of statistics:pp 1189-1232. 2001.
[7] Freund, Yoav, and Robert E. Schapire. A desicion-theoretic generalization of
on-line learning and an application to boosting. In Computational learning
theory, pp. 23-37. Springer Berlin Heidelberg, 1995.
101
102
[8] Breiman, Leo. Random forests. Machine learning 45, no. 1 (2001),pp. 5-32.
2001.
[9] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima-
tion of word representations in vector space. arXiv preprint arXiv:1301.3781
(2013).
[10] Abe, N., and Mamitsuka, H. Query learning strategies using boosting and
bagging. Proceedings of 15th International Conferenec on Machine Learning
(ICML-98), pp. 1-10. 1998.
[11] Prem Melville and Raymond J. Mooney.Diverse Ensembles for Active
Learning.Proceedings of the 21st International Conference on Machine
Learning,(ICML-2004), pp. 584-591. 2004.
[12] Ramos, Juan.Using tf-idf to determine word relevance in document queries.
In Proceedings of the first instructional conference on machine learning. 2003.
[13] Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, P. Kuksa. Natu-
ral Language Processing (Almost) from Scratch. Journal of Machine Learning
Research 12,pp. 2493-2537. 2011.
[14] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incor-
porating Non-local Information into Information Extraction Systems by Gibbs
Sampling. Proceedings of the 43nd Annual Meeting of the Association for Com-
putational Linguistics (ACL 2005), pp. 363-370. 2005.
[15] Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny,
Bethard, Steven J., and McClosky, David. 2014.The Stanford CoreNLP Nat-
ural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of
the Association for Computational Linguistics: System Demonstrations(ACL-
2014), pp. 55-60.2014.
[16] Hobbs, Jerry R.; Walker, Donald E.; Amsler, Robert A. (1982).Natural lan-
guage access to structured text Proceedings of the 9th conference on Compu-
tational linguistics 1. pp. 127-132.
103
[17] Tjong Kim Sang, E. F., De Meulder, F. (2003, May). Introduction to the
CoNLL-2003 shared task: Language-independent named entity recognition. In
Proceedings of the seventh conference on Natural language learning at HLT-
NAACL 2003-Volume 4 (pp. 142-147). Association for Computational Linguis-
tics.
[18] Nadeau, D., Sekine, S. (2007).A survey of named entity recognition and clas-
sification. Lingvisticae Investigationes, 30(1) , pp 3-26, 2007.
[19] Soon, W. M., Ng, H. T., Lim, D. C. Y. (2001). A machine learning approach
to coreference resolution of noun phrases. Computational linguistics, 27(4) ,pp.
521-544.
[20] Pang, B., Lee, L. (2008).Opinion mining and sentiment analysis. Foundations
and trends in information retrieval,Volume 2 ,Issue 1-2, January 2008, pp. 1-
135 .
[21] Rowe, Ryan, German Creamer, Shlomo Hershkop, and Salvatore J. Stolfo.
Automated social hierarchy detection through email network analysis. In Pro-
ceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining
and social network analysis, pp. 109-117. ACM, 2007.
[22] Liddy, E.D. 2001. Natural Language Processing. In Encyclopedia of Library
and Information Science, 2nd Ed. NY. Marcel Decker, Inc.
[23] Harris, Zellig S.Distributional structure.Word, Vol 10, 1954, pp. 146-162.
[24] Manning, C. D.; Raghavan, P.; Schutze, H. (2008). Scoring, term weighting,
and the vector space model. Introduction to Information Retrieval (PDF).pp.
52-100.
[25] Powers, David M W(2011).Evaluvation:From Precision, Recall and F-measure
to ROC, informedness, Markedness and Correlation.Journal of Machine Learn-
ing Technologies 2(1):37-63

Thesis-aligned-sc13m055

  • 1.
    BUSINESS EVENT RECOGNITIONFROM ONLINE NEWS ARTICLES A Project Report submitted by MOHAN KASHYAP.P in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY IN MACHINE LEARNING AND COMPUTING DEPARTMENT OF MATHEMATICS INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY Thiruvananthapuram - 695547 May 2015
  • 2.
    i CERTIFICATE This is tocertify that the thesis titled ’Business Event Recognition From Online News Articles’, submitted by Mohan Kashyap.P, to the Indian Insti- tute of Space Science and Technology, Thiruvananthapuram, for the award of the degree of MASTER OF TECHNOLOGY, is a bonafide record of the research work done by him under my supervision. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Dr. Sumitra.S Supervisor Department of Mathematics IIST Dr. Raju K. George Head of Department Department of Mathematics IIST Place: Thiruvananthapuram May, 2015
  • 3.
    ii DECLARATION I declare thatthis thesis titled ’Business Event Recognition From Online News Articles’ submitted in fulfillment of the Degree of MASTER OF TECH- NOLOGY is a record of original work carried out by me under the supervision of Dr. Sumitra .S, and has not formed the basis for the award of any degree, diploma, associateship, fellowship or other titles in this or any other Institution or University of higher learning. In keeping with the ethical practice in reporting scientific information, due acknowledgements have been made wherever the find- ings of others have been cited. Mohan Kashyap.P SC13M055 Place: Thiruvananthapuram May, 2015
  • 4.
    iii Abstract Business Event RecognitionFrom Online News Articles deals with the ex- traction of news from text related to business events in three domains Acquisition, Vendor-supplier and Job. The developed automated model for recognizing busi- ness events would predict whether the online news article contains a business event or not. For developing the model, the data related to business events had been crawled. Since the manual labeling of data was expensive, semi-supervised learn- ing techniques were used for getting required labeled data and then tagged data had been pre-processed using techniques of natural language processing. Further on vectorizers were applied on the text to convert it into numerics using bag-of- words, word-embedding and word2vec approaches. In the end ensemble classifiers with bag-of-words approach and CNN(Convolutional Neural Network) using word- embedding, word2vec approaches were applied on the business event datasets and the results obtained were found to be promising.
  • 5.
    Acknowledgements First and foremostI thank God, The Almighty, for all his blessings. I would like to express my deepest gratitude to my research supervisor and teacher, Dr. Sumitra .S for her continuous guidance and motivation without which this research work would never have been possible. I cannot thank her enough for her limit- less patience and dedication in correcting my thesis report and molding it into its present form. Interactions with her taught me the importance of small things which are often overlooked and an exposure to the art of approaching a problem at different angles. These lessons will be invaluable for me in my career and personal life ahead. Besides my supervisor, I would like to thank my mentor, Mr.Mahesh C.R. of TataaTsu Idea Labs for allowing me to carry my thesis work in their organization.I would like to express my deepest gratitude for him for helping me to realize my abilities and build confidence in me to to solve challenging problems in Machine Learning turning my theoretical understanding into practical real time implemen- tation.My sincere thanks also goes to all the faculty members of Mathematics Department for their encouragement, questions and insightful comments. I am grateful to my project lead at Tataatsu Idea labs Mr.Vinay and his team of the Tataatsu Idea labs for helping me in implementation of project work . I would like to appreciate Research Scholar Shiju.S.Nair for extending his ’any time’ help and thanks to him for providing additional inputs to my work. last but not the least i would like to thank my classmates and friends in IIST for their company and for all the fun we had during the two years of M.Tech.Hailing from Electrical Background not that great in coding special thanks goes to Praveen and Sailesh for constantly supporting me and guiding through for two years in ma- chine learning and Arvindh too for inspiring me in certain regards of the course iv
  • 6.
    v work. Last but notthe least, I would like to thank my parents and my sister for their care, love and support throughout my life.
  • 8.
    Contents Acknowledgements iv List ofFigures vi List of Tables ix List of Abbreviations xii 1 Introduction 1 1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . 3 Information Extraction and Retrieval: . . . . . . . . . 4 Named Entity Recognition: . . . . . . . . . . . . . . . 4 Parts Of Speech Tagging: . . . . . . . . . . . . . . . . 4 1.2.2 Text to Numeric Conversion . . . . . . . . . . . . . . . . . . 4 1.2.3 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3.1 Semi-supervised Technique . . . . . . . . . . . . . . 5 1.2.3.2 Active Learning . . . . . . . . . . . . . . . . . . . . 6 Uncertainty sampling: . . . . . . . . . . . . . . . . . . 6 Query by the committee: . . . . . . . . . . . . . . . . 6 Expected model change: . . . . . . . . . . . . . . . . 7 Expected error reduction: . . . . . . . . . . . . . . . . 7 Variance reduction: . . . . . . . . . . . . . . . . . . . 7 1.2.4 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.4.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . 7 Bagging: . . . . . . . . . . . . . . . . . . . . . . . . . 8 Boosting: . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . 8 Convolutional Layer: . . . . . . . . . . . . . . . . . . 8 Activation Function: . . . . . . . . . . . . . . . . . . 9 Pooling layer: . . . . . . . . . . . . . . . . . . . . . . 9 Fully connected layer: . . . . . . . . . . . . . . . . . . 9 Loss layer: . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Measures used for Analysing the Results: . . . . . . . . . . . 9 i
  • 9.
    Contents ii 1.3 RelatedWorks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Part-of-speech (POS) pattern of the phrase: . . . . . 11 Extraction of rhetorical signal features: . . . . . . . . 11 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 Second Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Third Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.3 Fourth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.4 Fifth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Data Extraction,Data pre-processing and Feature Engineering 14 2.1 Crawling of Data from Web . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Labeling of Extracted Data . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1.1 Acquisition Data Description . . . . . . . . . . . . 15 Acquisition event: . . . . . . . . . . . . . . . . . . . . 15 Non Acquisition event: . . . . . . . . . . . . . . . . . 15 2.2.1.2 Vendor-Supplier Data Description . . . . . . . . . . 15 Vendor-Supplier event: . . . . . . . . . . . . . . . . . 15 Non Vendor-Supplier event: . . . . . . . . . . . . . . 16 2.2.1.3 Job Data Description . . . . . . . . . . . . . . . . . 16 Job event: . . . . . . . . . . . . . . . . . . . . . . . . 16 Non Job event: . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Type 1 Features . . . . . . . . . . . . . . . . . . . . . . . . . 17 Noun, Noun-phrases and Proper nouns: . . . . . . . . 17 Example of Noun-phrase: . . . . . . . . . . . . . . . . 17 Word-Capital: . . . . . . . . . . . . . . . . . . . . . . 17 Example of Capital words: . . . . . . . . . . . . . . . 17 Parts of speech tag pattern: . . . . . . . . . . . . . . 17 Example of POS tag pattern Adj-Noun format: . . . 18 2.3.2 Type 2 Features . . . . . . . . . . . . . . . . . . . . . . . . . 18 Organization Name: . . . . . . . . . . . . . . . . . . . 18 Example of Organization names: . . . . . . . . . . . . 18 Organization references: . . . . . . . . . . . . . . . . 18 Examples of Organization references: . . . . . . . . . 18 Location: . . . . . . . . . . . . . . . . . . . . . . . . . 18 Example of location as feature . . . . . . . . . . . . . 18 Persons: . . . . . . . . . . . . . . . . . . . . . . . . . 18 Example of Persons: . . . . . . . . . . . . . . . . . . . 18 2.3.3 Type 3 Features . . . . . . . . . . . . . . . . . . . . . . . . . 19 Continuation: . . . . . . . . . . . . . . . . . . . . . . 19 Change of direction: . . . . . . . . . . . . . . . . . . . 19
  • 10.
    Contents iii Sequence: .. . . . . . . . . . . . . . . . . . . . . . . 19 Illustration: . . . . . . . . . . . . . . . . . . . . . . . 19 Emphasis: . . . . . . . . . . . . . . . . . . . . . . . . 19 Cause, condition or result : . . . . . . . . . . . . . . . 19 Spatial signals: . . . . . . . . . . . . . . . . . . . . . 19 Comparison or contrast: . . . . . . . . . . . . . . . . 19 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . 19 Fuzz: . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Description of Vectorizers . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Count Vectorizers . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1.1 Example of Count Vectorizer . . . . . . . . . . . . 20 2.4.2 Term Frequency and Inverse Document Frequency . . . . . . 21 2.4.2.1 Formulation of Term Frequency and Inverse Doc- ument Frequency . . . . . . . . . . . . . . . . . . . 21 Term-Frequency formulation: . . . . . . . . . . . . . . 21 Inverse Document Frequency formulation: . . . . . . . 21 2.4.2.2 Description of Combination of TF and IDF . . . . 22 2.4.2.3 Example of TF-IDF Vectorizer . . . . . . . . . . . 22 3 Machine Learning Algorithms Used For Analysis Of Business Event Recognition 24 3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation- Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Active Learning using Ensemble classifiers with QBC approach . . . 25 3.2.1 Query by committee . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Ensemble Models for Classification of Business Events using Bag- Of-Words Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Gradient Boosting Classifier . . . . . . . . . . . . . . . . . . 26 3.3.2 AdaBoost Classifier . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Random Forest Classifiers . . . . . . . . . . . . . . . . . . . 29 3.4 Multilayer Feed Forward with Back Propagation using word em- bedding approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Convolutional Neural Networks for Sentence Classification with un- supervised feature vector learning . . . . . . . . . . . . . . . . . . . 30 3.5.1 Variations in CNN sentence models . . . . . . . . . . . . . . 32 CNN-rand: . . . . . . . . . . . . . . . . . . . . . . . . 32 CNN-static: . . . . . . . . . . . . . . . . . . . . . . . 32 4 Results and Discussions 34 4.1 Semi-supervised Learning Implementation using Naive Bayes with Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 Results and Analysis of Vendor-Supplier Event Data . . . . 35 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 Results and Analysis for Job Event Data . . . . . . . . . . . 40 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 40
  • 11.
    Contents iv 4.1.3 Resultand Analysis for Acquisition Event Data . . . . . . . 45 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Active Learning implementation by Query by committee approach . 50 4.2.1 Results and Analysis for Vendor-Supplier Event Data . . . . 50 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Result and Analysis for Job Event Data . . . . . . . . . . . 55 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 Result and Analysis for Acquisition Event Data . . . . . . . 60 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Comparison of Semi-supervised techniques and Active learning ap- proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Results of Ensemble Classifiers with different Parameter tuning . . 65 4.4.1 Analysis for vendor-supplier event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . 65 4.4.2 Analysis for Job event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . . . . . . . 68 4.4.3 Analysis for Acquisition event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . 71 4.4.4 Analysis for Vendor-Supplier event Data using 500 estima- tors within the ensemble as the parameter . . . . . . . . . . 74 4.4.5 Analysis for Job event Data using 500 estimators within the ensemble as the parameter . . . . . . . . . . . . . . . . . . . 77 4.4.6 Analysis for Acquisition event Data using 500 estimators within the ensemble as the parameter . . . . . . . . . . . . . 80 4.5 Final Accuracies and F-score estimates for the model . . . . . . . . 83 4.5.1 Final Analysis of Vendor-Supplier Dataset . . . . . . . . . . 84 4.5.2 Final Analysis of Job Dataset . . . . . . . . . . . . . . . . . 85 4.5.3 Final Analysis of Acquisition Dataset . . . . . . . . . . . . . 87 4.6 Results obtained for MFN with Word Embedding . . . . . . . . . . 90 4.7 Results obtained for Convolutional Neural Networks . . . . . . . . . 90 4.7.1 Analysis for Vendor-Supplier Data using CNN-rand and CNN- word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7.2 Analysis for Acquisition Data using CNN-rand and CNN- word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.3 Analysis for Job using CNN-rand and CNN-word2vec Model 92 4.8 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 Conclusions and Future work 95 5.1 Challenges Encountered in Business Event Recognition . . . . . . . 95 5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
  • 12.
  • 14.
    List of Figures 3.1The Image describes the architecture for Convolutional Neural Net- work with Sentence Modelling for multichannel architecture . . . . 31 4.1 Variations in Accuracies and F1-scores for Vendor-supplier data us- ing Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . 36 4.2 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . 37 4.3 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . 38 4.5 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 38 4.6 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . 39 4.7 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Variations in Accuracies and F1-scores for Job event data using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . . . 41 4.9 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . 42 4.10 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 42 4.11 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . 43 4.12 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 43 4.13 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . 44 4.14 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 44 4.15 Variations in Accuracies and F1-scores for Acquisition event data using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . 46 4.16 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . 47 4.17 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 47 vii
  • 15.
    List of Figuresviii 4.18 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . 48 4.19 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 48 4.20 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Acquisition . . . . . . . . . . . . . . 49 4.21 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Acquisition . . . . . . . . . . . . . . . . . . . 49 4.22 variations in Accuracies and F1-scores for Vendor-supplier data us- ing Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.23 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier . . . . . . . . . . . . 52 4.24 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier . . . . . . . . . . . . . . . . 52 4.25 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Vendor-Supplier . . . . . . . . . . . 53 4.26 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Vendor-supplier . . . . . . . . . . . . . . . . 53 4.27 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier . . . . . . . . . . . . 54 4.28 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier . . . . . . . . . . . . . . . . 54 4.29 Variations in Accuracies and F1-scores for Job event data using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.30 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Job . . . . . . . . . . . . . . . . . . 57 4.31 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 57 4.32 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Job . . . . . . . . . . . . . . . . . . 58 4.33 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 58 4.34 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 59 4.35 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 59 4.36 Variations in Accuracies and F1-scores for Acquisition event data using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.37 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . 62 4.38 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 62 4.39 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . 63
  • 16.
    List of Figuresix 4.40 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 63 4.41 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 64 4.42 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 64 4.43 variations in Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 67 4.44 Confusion matrix for Vendor-supplier with number of estimators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.45 Roc curve for for Vendor-supplier with number of estimators as 100 68 4.46 Variations in Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.47 Confusion matrix for Job with number of estimators as 100 . . . . . 71 4.48 Roc curve for for Job with number of estimators as 100 . . . . . . . 71 4.49 Variations in Accuracies and F1-scores for Acquisition data for 5- fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 73 4.50 Confusion matrix for Acquisition with number of estimators as 100 74 4.51 Roc curve for for Acquisition with number of estimators as 100 . . . 74 4.52 Variations in Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 76 4.53 Confusion matrix for Vendor-supplier with number of estimators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.54 Roc curve for for Vendor-supplier with number of estimators as 500 77 4.55 Variations in Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.56 Confusion matrix for Job with number of estimators as 500 . . . . . 80 4.57 Roc curve for for Job with number of estimators as 500 . . . . . . . 80 4.58 Variations in Accuracies and F1-scores for Acquisition data for 5- fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 82 4.59 Confusion matrix for Acquisition with number of estimators as 500 83 4.60 Roc curve for for Acquisition with number of estimators as 500 . . . 83 4.61 Variations in Accuracies and F1-scores for Vendor-supplier data for whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.62 Variations in Accuracies and F1-scores for Job data for whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.63 Variations in Accuracies and F1-scores for Acquisition data 5-folds accuracy variations for whole data set . . . . . . . . . . . . . . . . . 89 4.64 CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.65 CNN-rand and CNN-word2vec models for Acquisition on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.66 CNN-rand and CNN-word2vec models for Job on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
  • 18.
    List of Tables 1.1Recognition of Named-Event Passages in News Articles and its ap- plication to our work . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 The words and their counts in the sentence1 . . . . . . . . . . . . . 22 2.2 The words and their counts in the sentence2 . . . . . . . . . . . . . 22 4.1 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for vendor-supplier data . . . . . . . . . . . . . . 36 4.2 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Job event data . . . . . . . . . . . . . . . . . 41 4.3 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Acquisition event data . . . . . . . . . . . . . 46 4.4 Variation in accuracies and F-scores using Active Learning for Vendor- supplier event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Variation in accuracies and F-scores using Active Learning for Job event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Variation in accuracies and F-scores using Active Learning for Ac- quisition event data . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in vendor-supplier data set 66 4.8 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in vendor-supplier data . . . . 66 4.9 Variation in accuracies and F-scores for random forest classifier for number of parameter estimate as 100 in vendor-supplier data . . . . 66 4.10 Variation in test score for accuracy and F-score with Vendor-supplier data using voting of three ensemble classifiers with number of esti- mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.11 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in Job data set . . . . . . 69 4.12 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Job data set . . . . . . . . 69 4.13 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Job data set . . . . . . . . 69 4.14 variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 100 70 4.15 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in Acquisition data set . . 72 xi
  • 19.
    List of Tablesxii 4.16 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Acquisition data set . . . . 72 4.17 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Acquisition data set . . . . 72 4.18 Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of esti- mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.19 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in vendor-supplier data set 75 4.20 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in vendor-supplier data set . . 75 4.21 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in vendor-supplier data set . . 75 4.22 Variation in test score for accuracy and F-score with vendor-supplier data using voting of three ensemble classifiers with number of esti- mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.23 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in Job data set . . . . . . 78 4.24 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in Job data set . . . . . . . . 78 4.25 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Job data set . . . . . . . . 78 4.26 variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 500 79 4.27 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in Acquisition data set . . 81 4.28 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Acquisition data set . . . . 81 4.29 Variation in accuracies and F-scores for Ada boosting classifier for number of parameter estimate as 500 in Acquisition data set . . . . 81 4.30 Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of esti- mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.31 Variation in accuracies and F-scores for Gradient Boosting classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . 84 4.32 Variation in accuracies and F-scores for Ada Boosting classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 84 4.33 Variation in accuracies and F-scores for Random forest classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 85 4.34 Variation in accuracies and F-scores for Gradient Boosting classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.35 Variation in accuracies and F-scores for Ada Boosting classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.36 Variation in accuracies and F-scores for Random forest classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
  • 20.
    List of Tablesxiii 4.37 Variation in accuracies and F-scores for Gradient Boosting classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . 88 4.38 Variation in accuracies and F-scores for Random forest classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 88 4.39 Variation in accuracies and F-scores for Ada boosting classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 89 4.40 Variation in test score for MFN with word embedding . . . . . . . . 90 4.41 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set . . . . . . . . . . . . . 91 4.42 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Acquisition on whole data set . . . . . . . . . . . . . . . 92 4.43 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Job on whole data set . . . . . . . . . . . . . . . . . . . 93
  • 22.
    List of Abbreviations POSParts of speech NLTK Natural Language Tool Kit QBC Query By Committee NLP Natural language processing IE Information Extraction IR Information Retrieval NER Named entity recognizer ML Machine Learning CNN Convolutional Neural network MFN Multilayer feed forward network TF Term Frequency IDF Inverse Document Frequency CBOW Continuous bag of words ROC Receiver operator characteristic TPR True Positive Rate FPR False Positive Rate TP True Positives FP False Positives TN True Negatives FN False Negatives xv
  • 24.
    Chapter 1 Introduction Textual informationpresent in the web is unstructured and extracting useful infor- mation from it for a specific purpose is tedious and challenging. So over the years various methods have been proposed for extraction of useful text. Text mining is the domain that deals with the process of deriving high-quality information from unstructured text. The goal of text mining is essentially to convert unstructured text into structured data and there by extracting some useful information by ap- plying techniques of natural language processing (NLP) and pattern recognition. The concept of manual text mining was first introduced in mid-1980’s (Hobbs et al., 1982). Over the past decade technological advancements in this field have been significant with building of automated approaches for extraction and analysis of text. Text mining is composed of five major components: information retrieval, data mining, machine learning, statistics and computational linguistics. The application of text mining are in the various domains which includes: (a) Named entity recognition which deals with identification of named text features such as people, organization and location(sang et al., 2003). (b) Recognition of pattern identified entities which deals with extraction of features such as telephone numbers, e-mail address and built-in database quantities that can be discerned using regular expression or other pattern matches(Nadeau et al., 2007). (c) Co- reference deals with the identification of noun phrases and other terms that refer to these nouns eg: such as her, him, it and their(Soon et al., 2001). (d) Sentiment analysis which includes extracting various forms of users intent information such 1
  • 25.
    2 as sentiment, opinion,mood and emotion. Text analytics techniques are helpful in the analysis of sentiment at different topics level(pang et al., 2008). (e) Spam detection which deals with the classification of e-mail as spam or not, based on application of statistical machine learning and text mining techniques.(Rowe et al., 2007) (f) News analytics which deals with extraction of vital news or information content of an interest to the end user. (h) Business event recognition from online news articles. Business Event Recognition From Online News Articles captures semantic signals and identifies pattern from unstructured text to extract business events in three main domains i.e. acquisition, vendor-supplier and job events from online news articles. Acquisition business event news pattern in general is of the context organization acquiring another organization. The keywords used in acquisition business events scenario are acquire, buy, sell, sold, bought, take-over, purchase and merger. Vendor-supplier business event news pattern in general is of the context organization obtaining a contract from another organization to perform certain task for that organization. The keywords used in vendor-supplier business event scenario are contract, procure, sign, implement, select, award, work, agreement, deploy, provide, team, collaborate, deliver and joint. Job business event news pattern in general is of the context of appointments of persons to prominent positions, hiring and firing of people within an organizations. Our thesis deals with the development of an automated model for busi- ness event recognition from online news articles. For developing the automated model of Business Event Recognition From Online News Articles, data has been crawled from different websites such as reutersnews, businesswireindia.com and prnewswire.com. Since the manual labeling of the data was expensive, the gath- ered data was subjected to semi-supervised learning techniques and active learn- ing methods for getting more tagged event data in the domains of acquisition, vendor-supplier and job. Then the obtained tagged data was pre-processed using natural language processing techniques. Further on, for the conversion of text to numerics the bag-of-words, word-embedding and word2vec approaches were
  • 26.
    3 used. Final analysison the business event dataset was performed using ensem- ble classifiers with bag-of-words approach and convolutional neural network with word-embedding, word2vec approach. 1.1 Model Architecture Given a set of online articles or documents which is of interest from the end user, our developed automated model must predict the class output as whether the given sentence contains business event related to acquisition, vendor-supplier and job events. If the automated model predicts a sentence as a business event then it has to give out additional information regarding the description of the event such as entities involved in that particular event like organizations and people. Provid- ing such additional information helps the end user to make better decisions with quicker insights. On daily basis around the world business events are happening. An orga- nization as a competitor would like to understand the business analytics of other organizations. The development of an automated approach for identifying such business events helps in better decision making, increases efficiency and helps to develop better business strategies for that organization. 1.2 Methods Given below sections are the methods used for our work. 1.2.1 Natural Language Processing The concept of information extraction and information retrieval in our work deals with extraction and retrieval of the business news containing the business event sentences from the online news articles. The concepts of part-of-speech (POS) tagging and named entity recognition (NER) are used as part of feature engi- neering in our work. The pattern of POS tagging is essential in extracting useful
  • 27.
    4 semantic features andNER is useful in extracting entity type features like or- ganizations, persons and location which form the integral part of any business event. The framework for our project is formed by the concepts of information extraction (IE) and information retrieval (IR). Discussed below are the methods of information extraction and retrieval, named entity recognition(NER) and parts- of-speech(POS) tagging which forms the baseline for implementation of natural language processing techniques(Liddy, 2001). Information Extraction and Retrieval: Information extraction and retrieval deals with searching of required text, extraction of semantic information from text and storing of retrieved information in a particular form in the database. Named Entity Recognition: Named entity recognition deals with extraction from a document of text a set of people or places and type based entities which include organizations. Parts Of Speech Tagging: The pattern of POS tagging forms an important set of features for any NLP related task. Extraction of proper semantic features is possible with the pattern of POS tags. 1.2.2 Text to Numeric Conversion The conversion of word to vectors was implemented using the bag-of-words and word embedding approach. Described below is the overview of these concepts. In bag-of-words approach piece of text or sentence of a document is repre- sented as the bag(multiset) of words disregarding the grammar and the word order, but keeping the multiplicity of the words intact(Harris, 1954). Word embedding is the collective name for a set of language modeling and feature learning tech- niques in natural language processing where words from the sentences are mapped to vectors of real numbers in a low dimensional space, relative to the vocabulary size(Tomas Mikilov et al.,2013). One of the major disadvantages of bag-of-words approach is it fails to capture the semantics of a particular word within in a sentence, because it converts words to
  • 28.
    5 vectors disregarding thegrammar and the order. Consider the following sentences where the bag-of-words approach fails. After drawing money from the Bank Ravi went to the river Bank. In the bag-of-words approach there is no distinction between financial word Bank and river Bank. This problem of capturing semantics of the word to a certain extent overcome by word-embedding. In word embedding each word is represented by a 100 to 300 dimensional uniformly distributed (i.e U[-1,1])random dense vector. Word-embedding with window approach captures semantics to certain extent. 1.2.3 Data Labeling The extracted data labeled in supervised manner were few in number. The sections below describe the semi-supervised technique and active learning methods. 1.2.3.1 Semi-supervised Technique The naive Bayes classifier forms the integral part in the implementation of semi- supervised learning using naive Bayes classifier with expectation maximization to increase the number of labeled data points (kamalNigam et al.,2006). Discussed below is an overview of the naive Bayes classifier. Naive Bayes classifiers are probabilistic classifiers which use the concept of Bayes theorem. In naive Bayes classifier assumption is made that one feature is condi- tionally independent from another feature. The modeling of a naive Bayes classifier is described as follows: Given a input feature vector x=(x1, x2, ...., xn)T we need to calculate which class does this feature vector belong to i.e. p(Yk|x1, x2, ...xn) for each k classes, where Yk is the output variable for the kth class. Now using the concept of the Bayes theorem we can rewrite the above probability expression as: p(Yk|x) = p(Yk)p(x|Yk) p(x) where p(Yk) = are prior probabilities for that particular class p(x|Yk) = is the maximum likelihood estimator
  • 29.
    6 p(x) = isthe probability of choosing that particular data point The naive Bayes classifier framework uses the maximum posteriori rule, to pick the output which is most probable output for that particular class. Maximum posteriori probabilities = Prior× Maximum likelihood Naive Bayes classifier assigns a label ˆy = Yk based on MAP rule, and classifier prediction is given as follows. ˆy = argmax k∈{1,...,K} p(Yk) n i=1 p(xi|Yk). In text mining the classifier used is Multinomial Naive Bayes classifier with the bag-of-words approach. 1.2.3.2 Active Learning Active learning with query by committee approach using ensemble classifiers was implemented as part of our work to increase the number of labeled data(Abe and Mamitsuka, 1998). Discussed below is the concept of active learning. Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other informa- tion source) to obtain the desired outputs at new data points. There are situations in which unlabeled data is abundant but manually labeling is expensive. In such a scenario, learning algorithms can actively query the user for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. The following discussed are query strategies for querying most informative data points in active learning. Uncertainty sampling: Uncertainty Sampling deals with labeling of those points for which current model is least certain about or for which labeled data point en- tropy value is maximum, by querying with the user. Query by the committee: A combination of classifiers are trained on the current labeled data points. Finally take the vote on the predicted labels of the
  • 30.
    7 classifiers and querythe labels by the user for labels which the classifiers disagree the most. Expected model change: Labeling of the data points which would result in drastic change in the current model. Expected error reduction: Labeling of those points which would reduce the most current model’s generalization error. Variance reduction: Labeling of those points which minimizes the output vari- ance of the current model the most, which are the points near by to the marginal hyper-plane in SVM. 1.2.4 Learning Classifiers The classifiers used in our work were ensemble classifiers and convolutional neural networks(CNN). The sections below describe the basic overview of concepts that are required to understand the ensembles methods and CNN that was implemented as in our work. 1.2.4.1 Ensemble Classifiers Random forest classifier implemented in our work (Breiman, 2001) is derived from the concept of bootstrap aggregation technique. Gradient boosting classifier (Friedman et al., 2001) and ada boost classifier (Freund et al., 1995) implemented in our work are derived from the boosting algorithm technique. Discussed below are concepts of ensembles with bagging and boosting. Ensembles are the concept of combining classifiers by which the performance of the combined classifier on the model is increased compared to the performance of each individual classifier on the model. There two different kinds of ensemble methods in practice, one is bagging also called as bootstrap aggregation and the other method is boosting.
  • 31.
    8 Bagging: In bagging,from a subset of training data at each instance a single classifier is learnt. From a training set M, then its possible to draw M random instances using a uniform distribution. These M samples drawn at each instant can be learned using a classifier and this process is repeated several times. Since the sampling drawn is with replacement there are chances that certain data points get picked up twice and certain data points don’t, within the subset of the original training dataset. A classifier is learnt using these subsets of training data set for each cycle. Final prediction is based on taking vote of classifier for different generated datasets. Boosting: In boosting using a subset of data at each instance a single classifier or different classifiers are learnt. Boosting technique analyses the performance of learnt classifier in each instant and forces the classifier to learn those training sam- ple instances which was incorrectly classified by the classifier. Instead of choosing the M training instances randomly using a uniform distribution, one chooses the training instances in such a manner as to favour the instances that have not been accurately learned by the classifier. The final prediction is performed by taking the weighted vote of each classifier learnt various different instances. 1.2.5 Convolutional Neural Network Convolutional neural networks for sentence modelling, trained on softmax-classifier was implemented in our work(Yoon kin, 2014). Discussed below is the overview of a generalized convolutional neural network and softmax-classifier. Convolutional neural network is a type of feed-forward neural network whose archi- tecture consists of three main layers which are convolutional layer, pooling layer, fully-connected layer and loss layer. The stacking of these layers forms the full conv-net architecture. Convolutional Layer: In conventional convolutional operation of sobel, pewitt filters on the image is useful in detecting the features of the image such as edge,
  • 32.
    9 corners etc., incomparison the convolutional neural net, parameters of each convo- lutional kernel i.e. (the each filter) is trained by the back-propagation algorithm. There are many convolution kernels in each layer, and each kernel is replicated over the entire image with the same parameters. The function of the convolution operators is to extract different features of the input. Activation Function: Activation function used in convolutional neural net- works are hyperbolic-tangent function f(x) = tanh(x), RELU function f(x) = max(0, x) and sigmoid function f(x) = 1 1+exp(−x) . Pooling layer: This set of layer captures the most important feature by per- forming the max operation on the obtained feature map vector. All such max features obtained form the penultimate layer. Fully connected layer: Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. A fully connected layer takes all neurons in the previous layer (be it fully connected, pooling, or convolutional) and connects it to every single neuron it has. Fully connected layers are not spatially located anymore (you can visualize them as one-dimensional), so there can be no convolutional layers after a fully connected layer. Loss layer: From fully connected layer, a soft-max classifier is present at the output layer with a soft-max loss function, to predict the probabilistic labels. Soft-max classifier is a classifier obtained from the soft-max function, for a sample input vector x, the predicted probability output y for the jth class among K classes is given as: P(y = j|x) = exTwj K k=1 exTwk 1.2.6 Measures used for Analysing the Results: The performance measures used for our results and analysis are as described as follows(Powers et al., 2007)
  • 33.
    10 1. F-score: F-scoreis a kind of measure used in information retrieval for mea- suring the sentence classification performance, since it takes only the true positives into account and not true negatives while calculating the measure. The F-score is described as: F1 = 2×TP 2×TP+FP+FN 2. Confusion matrix: The performance of any classification algorithm can be visualized by a specific table layout which is called the confusion matrix. Each column of the confusion matrix represents the instances in a predicted class, while each row of the confusion matrix represents the instances in an actual class. 3. ROC curve: It is a plot between TPR and FPR. The TPR desrcibes about the number of true positive results present in total positive samples. FPR describes about the set of incorrect positive results present in total negative samples. Area under the ROC curve is a measure of accuracy. 4. Accuracy: Accuracy of a classification problem is defined as: accuracy = TP+TN P+N 1.3 Related Works The paper which is close our work of Business Event Recognition From Online News Articles is Recognition of Named-Event Passages in News Articles (Luis- Marujo et al., 2012). This paper describes about the method for finding named events in violent behaviour domain and business domain, in the specific passages of news articles that contain information about such events and report their pre- liminary evaluation results using techniques of NLP and ML algorithms. The following table (1.1) describes about the paper Recognition of Named-Event Pas- sages in News Articles and its application to our work. As part of feature engineering used in our work, we have used some of the feature engineering techniques as in (LuisMarujo et al., 2012). The following are the features extracted used in our work as with reference to this paper.
  • 34.
    11 Part-of-speech (POS) patternof the phrase: (e.g., < noun >, < adj, noun > , < adj, adj, noun >, etc.) Noun and noun phrases are the most common pattern observed in key phrases containing named events, verb and verb phrases are less frequent, and key phrases made of the remaining POS tags are rare. Extraction of rhetorical signal features: These are set of features which capture the readers attention in News Events which are continuation, change of direction, sequence, illustration, emphasis, cause, condition, result, spatial-signals, comparison/contrast, conclusion and fuzz. 1.4 Thesis Outline Second chapter deals with the extraction and understanding of business event data, third chapter deals with the application of machine-learning algorithms on obtained data, fourth chapter deals with results and analysis on the business event datasets and finally fifth chapter deals with conclusion of our work. 1.4.1 Second Chapter This chapter deals with extraction of business event data from web, followed by pre-processing of the data. Application of feature engineering on the obtained data and finally converting the data into vectors for applying machine-learning algorithms. 1.4.2 Third Chapter This chapter deals with applying semi-supervised techniques on the data to in- crease the number of data points and understanding of algorithms of different ensemble classifiers and CNN(convolutional neural network).
  • 35.
    12 Table 1.1: Recognitionof Named-Event Passages in News Articles and its application to our work Recognition of named-event passages from news articles Business event recognition from online news articles 1.Deals with the automatically identi- fying multi-sentence passages in a news article that describe named events. Specifically this paper focuses on ten event types, five are in the violent behavior domain: terrorism, suicide bombing, sex abuse, armed clashes, and street protests. The other five are in the business domain: man- agement changes, mergers and acqui- sitions, strikes, legal troubles, and bankruptcy. 1.Our work derived as part of Recog- nition of Named-Event Passages from News articles focuses exclusively of identifying the business events in the domains of merger and acquisition, vendor-supplier and job events. 2.The problem is solved as multiclass classification problem for which the training data was obtained as part of crowd-sourcing using amazon mechan- ical turk to label the data points as events or not events. Then using en- semble classifiers for the classification of these sentences for each event. Fi- nally aggregating passages containing the same events using HMM methods. 2.The problem in our case is solved as a binary classification for the three do- mains merger and acquisition, vendor- supplier and job, describing as, that particular event or not. The proce- dure used in our case varies as we label few data points in the supervised way and then by applying semi-supervised techniques we increase the number of labeled data points. Finally applying ensemble classifiers and convolutional neural networks for classification of the labeled data points.
  • 36.
    13 1.4.3 Fourth Chapter Thischapter deals with results and analysis of applied machine-learning techniques which includes semi-supervised learning analysis, ensemble classifier analysis and analysis of convolutional neural networks. 1.4.4 Fifth Chapter This chapter deals with challenges encountered while performing the project, con- clusion of the project and future scope of the project. 1.5 Thesis Contribution Our work focuses on business event recognition in three domains: acquisition, vendor-supplier and job. This whole process of identifying the business event news exclusively in these three domains using the knowledge of machine learning and NLP techniques is main contribution of our work.
  • 38.
    Chapter 2 Data Extraction,Data pre-processingand Feature Engineering Initial step in Business Event Recognition is business news extraction and labeling few of the extracted data, so that it can be formulated as a machine learning problem. The method of data extraction from web and labeling some of the extracted data is described in the following section. 2.1 Crawling of Data from Web There are several methods to crawl the data from web. One of such methods is described in this section. Every website has its own HTML logic. So sepa- rate crawling logic had to be written to extract text data from different websites. Modules used for data extraction in python are Beautiful-soup and Urllib. For ex- traction of the data for our study, information is extracted from particular websites such as businesswireindia.com, prnewswire.com and reuters news. Python language frame work used in our work. Urllib module in python is used get particular set of pages which has to be accessed within the web. Beautiful- soup module in python uses the HTML logic and finds the contents present within that page in the format of the title, subtitle and the description corresponding 15
  • 39.
    16 to each contentblock by block. Finally the extracted title, subtitle and body contents are stored in the text-file formats. 2.2 Labeling of Extracted Data Since the business events are in the form of sentences, the text document obtained as raw text as part of web crawling, is split up into sentences using a natural language processing toolkit(NLTK) sentence tokenizer. Some of the sentences were labeled into three classes: merger and acquisition, vendor-supplier and job describing whether it is a business event or not. 2.2.1 Data Description Stated below is an illustration of data describing business event or not a business event in three classes of acquisition, vendor-supplier and job. 2.2.1.1 Acquisition Data Description Acquisition event: ARMONK, N.Y., April 10, 2014 /PRNewswire/ – IBM (NYSE: IBM) today announced a definitive agreement to acquire Silverpop, a privately held software company based in Atlanta, GA. Non Acquisition event: : Carlyle invests across four segments Corporate Pri- vate Equity Real Assets Global Market Strategies and Solutions in Africa Asia Australia Europe the Middle East North America and South America. 2.2.1.2 Vendor-Supplier Data Description Vendor-Supplier event: : Tri-State signs agreement with NextEra Energy Re- sources for new wind facility in eastern Colorado under the Director Jack stone; WESTMINSTER, Colo., Feb. 5, 2014 /PRNewswire/ – Tri-State Generation and Transmission Association, Inc. announced that it has entered into a 25-year agree- ment with a subsidiary of NextEra Energy Resources, LLC for a 150 megawatt
  • 40.
    17 wind power generatingfacility to be constructed in eastern Colorado,in the ser- vice territory of Tri-State member cooperative K. C. Electric Association (Hugo, Colo.). Non Vendor-Supplier event: The implementation of the DebMed GMS elec- tronic hand hygiene monitoring system is a clear demonstration of Meadows Re- gional Medical Center’s commitment to patient safety, and we are excited to partner with such a forward-thinking organization that is focused on providing a state-of-the-art patient environment, said Heather McLarney, vice president of marketing, DebMed. 2.2.1.3 Job Data Description Job event: In a note to investors, analysts at FBR Capital Markets said the appointment of Nadella as Director of the company was a ”safe pick” compared to choosing an outsider. Non Job event: This partnership is an example of steps we are taking to sim- plify and improve the Tactile Medical order process, said Cathy Gendreau,Business Director. 2.2.2 Data Pre-processing The extracted business event sentences as raw text as part of data extraction was cleansed by removing of special characters and stop-words which include words like the, and, an etc. The stopwords are common between positive class and the negative class, and hence to enhance the difference between positive class and negative class we had to remove them. NLTK module in python was used for the above pre-processing of the data. 2.3 Feature Engineering To build hand crafted features, we had to observe the extracted unstructured data and recognize pattern, so that useful features could be extracted. The features
  • 41.
    18 extracted are describedbelow and examples for the corresponding features are taken with reference to vendor-supplier event in (2.2.1.2). 2.3.1 Type 1 Features Shallow semantic features- records the pattern and semantics of the data, which consists of the following features (Luismaurijo et al.,2012). Noun, Noun-phrases and Proper nouns: Entities form an integral part of business event sentences, so noun phrases and proper nouns are common in sen- tences containing business events. Using NLTK-parts of speech tagger from the sentence noun phrase was extracted, correspondingly nouns and proper-nouns. Example of Noun-phrase: Title agreement Next Era Energy wind facility eastern Colorado WESTMINSTER Colo. Feb. Generation Transmission Associa- tion Inc. agreement subsidiary NextEra Energy LLC megawatt wind power facility eastern Colorado service territory member K. C. Electric Association Hugo Colo. Word-Capital: If there is a capital letter present in sentence containing the business event, there is a higher chance of organizations, locations and persons be- ing present in the sentence, which inturn are entity kind of features which enhances the event recognition. Example of Capital words: WESTMINSTER LLC, K.C..Here WESTMIN- STER is an Location and K.C. is an Organization, an illustration of Entity features obtained from Capital-Word as feature. Parts of speech tag pattern: Pattern of parts of speech tags adjective-noun, i.e noun followed by adjective, adjective-adjective-noun, i.e noun followed by two adjectives are good sets of features in event recognition. Adjectives are used in scenarios to describe a noun, so there is higher chance of finding this kind of scenario in business event sentence. Noun and noun phrases are the most common
  • 42.
    19 pattern observed inkey phrases of business event sentence, verb and verb phrases are less frequent and key phrases made of the remaining POS tags are rare. Example of POS tag pattern Adj-Noun format: new wind 25-year agree- ment Tri-State member, here adjective is agreement and noun is Tri-State member. 2.3.2 Type 2 Features Entity type features: To capture the entities present in the business event sentence. Following described are some of the features. Organization Name: Organizations names are usually present in sentences containing business events, which often give additional insights as features in event recognition. Example of Organization names: Tri-state Tri-State Generation and Trans- mission Association, NextEra Energy Resources. Organization references: Referencing organization entities present in the busi- ness event sentences are taken as features. Examples of Organization references: K. C. Electric Association Location: Location is an important entity describing feature giving more insight to description of business events. Example of location as feature : WESTMINSTER Colo. Colorado Persons: Their is a higher chance of person or a group of people being present in the sentences that contain business events, so persons are used as features to enhance business event recognition. Example of Persons: Jack stone
  • 43.
    20 2.3.3 Type 3Features Rhetorical features : These are semantic signals which capture readers attention in an business event sentences, following eleven features are identified in the literature as described in (Luismaurijo et al.,2012). Continuation: There are more ideas to come e.g.: moreover, furthermore, in addition, another. Change of direction: There is a change of topic e.g.: in spite of, nevertheless, the opposite, on the contrary. Sequence: There is an order in the presenting ideas e.g.: in first place, next, into. Illustration: Gives an example e.g.: to illustrate, in the same way as, for in- stance, for example. Emphasis: Increases the relevance of an idea these are the most important sig- nals e.g.: it all boils down to, the most substantial issue, should be noted, the crux of the matter, more than anything else. Cause, condition or result : There is a conditional or modification coming to following idea e.g.: if, because, resulting from. Spatial signals: Denote locations e.g.: in front of, between, adjacent, west, east, north, south, beyond. Comparison or contrast: Comparison of two ideas e.g.: analogous to, better, less than, less, like, either. Conclusion: Ending the introduction of the idea and may have special impor- tance e.g.: in summary, from this we see, last of all, hence, finally.
  • 44.
    21 Fuzz: There isan idea that is not clear e.g.: looks like, seems like, alleged, maybe, probably, sort of. 2.4 Description of Vectorizers All the features extracted with the given sentence has to be converted into vectors using vectorizers such as Count-vectorizers, TF-IDF vectorizers. The method used to convert words to vectors is bag of words approach. Following are the two vectorizers described below using bag of words approach. 2.4.1 Count Vectorizers This module uses the counts of the words present within a sentence and converts it into vectors by building the dictionary for the word to vector conversion(Harris, 1954). An illustrative of example count vectorizer is described below. 2.4.1.1 Example of Count Vectorizer Consider the following two sentences. a) John likes to watch movies. Mary likes movies too. b) John also likes to watch football games. Based on the above two sentences dictionary is constructed as follows: { John:1 , likes:2 , to:3 , watch:4 , movies:5 , also:6 , football:7 , games:8 , Mary:9 , too:10 } The dictionary constructed has 10 distinct words. Using the indexes of the dictio- nary, each sentence is represented by a 10-entry vector: sentence1 : [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] sentence2 : [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] where each entry of the vectors refers to count of the corresponding entry in the dictionary (this is also the histogram representation). For example, in the first vector (which represents sentence 1), the first two entries are [1,2]. The first entry corresponds to the word John which is the first word in the dictionary, and its value is 1 because John appears in the first sentence 1 time. Similarly the second
  • 45.
    22 entry corresponds tothe word likes which is the second word in the dictionary, and its value is 2 because likes appears in the first sentence 2 times. This vector representation does not preserve the order of the words in the original sentences. 2.4.2 Term Frequency and Inverse Document Frequency Term frequency and inverse document frequency describes importance of a partic- ular word in the document or a sentence, in a collection of documents (Manning et al.,2008). Term-Frequency(Tf)-is defined as the number of occurrences of a particular word within that document. Inverse Document Frequency(IDF)-is defined as number of documents containing the particular word. For analysis in our work using tf-idf with bag of words approach, we treat each document as a sentence. Tf-idf is a short form of term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a particular word is to a sentence, in a collection of sentences. 2.4.2.1 Formulation of Term Frequency and Inverse Document Fre- quency Term-Frequency formulation: The term frequency tf(t,d) describes the num- ber of times that term t occurs in the sentence d. Two formulations of term fre- quency is described below: a)Boolean frequencies: tf(t,d) = 1 if t occurs in d and 0 otherwise. b)logarithmically scaled frequency: tf(t,d) = 1 + log tf(t,d) if t occurs in d and 0 otherwise. Inverse Document Frequency formulation: Inverse document frequency is a measure of how much information a particular word provides in a sentence, in comparison with the collection of sentences under consideration. Inverse document frequency measures whether the term is common or rare across all collection of
  • 46.
    23 sentences. Mathematically itis described as follows: idf(t, D) = log N |{d∈D:t∈d}| N: total number of sentences in the collection of sentences. {d ∈ D : t ∈ d} : number of sentences d where the term t appears (i.e., tf(t, d) = 0). If the term is not in the set of sentences, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}|. 2.4.2.2 Description of Combination of TF and IDF Then tf-idf is calculated as Tf-idf (t, d, D) = tf(t, d) ×idf(t, D) A high weight in tf-idf is reached by a high term frequency (in the given sentence) and a low document frequency of the term in the whole collection of sentences, the weights hence tend to filter out common terms. 2.4.2.3 Example of TF-IDF Vectorizer Consider term frequency tables (2.1) and (2.2) for a collection consisting of only two sentences, as listed below. Table 2.1: The words and their counts in the sentence1 Sentence1 Term Term Count this 1 is 1 a 2 sample 1 Table 2.2: The words and their counts in the sentence2 Sentence2 Term Term Count this 1 is 1 another 2 example 3
  • 47.
    24 The calculation oftf-idf for the term this in sentence1 is performed as follows. Term-frequency in its basic form, is just the frequency that we look up in appro- priate table. In this case, it’s one for the term this in sentence1. IDF for the term this in sentence1 is given as follows: idf(this, D) = log N |{d∈D:t∈d}| The numerator of the fraction i.e. N, is the number of sentences which is two. The number of sentences in which this appears is also two, giving the IDF as: idf(this, D) = log 2 2 = 0 So the tf-idf value is zero for this term and with the basic definition this is true of any term that occurs in all sentences. Now consider the example term from the sentence2, which occurs three times but in only one sentence that is sentence2. For this sample tf-idf of example term is tf(example, sentence2) = 3 idf(example, D) = log 2 1 ≈ 0.3010 tfidf(example, sentence2) = tf(example, sentence2)×ID(example, D) = 3×0.3010 ≈ 0.9030
  • 48.
    Chapter 3 Machine LearningAlgorithms Used For Analysis Of Business Event Recognition This chapter discusses the set of machine learning algorithms which were imple- mented as part of our work. The semi-supervised approach with naive Bayes expectation- maximization and active learning with QBC are used in our work to increase the amount labeled data. Gradient boosting classifier, ada boost classifier, random forest classifier, multilayered feed forward network and covolutional neu- ral network are used to classify business event data in our work. So the following sections would give us the detailed understanding regarding these algorithms. 3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation-Maximization Al- gorithm In this approach first a naive Bayes classifier is built in the standard supervised fashion from the limited amount of labeled training data and we perform classifica- tion of the unlabeled data with the naive Bayes model, by noting the probabilities associated with each class. Then we rebuild a new naive Bayes classifier using all 25
  • 49.
    26 the labeled dataand unlabeled using the estimated class probabilities as true class labels. We iterate this process of classifying the unlabeled data and rebuilding the naive Bayes model until it converges to a stable classifier, and the corresponding set of labels for the unlabeled data are obtained. The algorithm is summarized below as in (KamalNigam et al.,2006). 1. Inputs: collections Xl of labeled sentences and Xu of unlabeled sentences. 2. Build an initial naive Bayes classifier K* from the labeled sentences Xl only. 3. Loop while classifier parameters improve, as measured by the change in l(K|X, Y ), (the log probability of the labeled and unlabeled data and the prior) (a) (E-step) Use the current classifier, K* , to estimate component member- ship of each unlabeled sentence, i.e. the probability that each mixture component (and class) generated each sentence P(Y = cj|X = xi;K* ), where X and Y are random variables, cj output of jth class and xi is ith input datapoint. (b) (M-step) Re-estimate the classifier,K* , given the estimated component membership of each sentence. Use maximum a posteriori parameter estimation to find K* = arg max K P(X, Y |K) P(K) 4. Output is the classifier K* , that takes an unlabeled sentence and predicts a class label. 3.2 Active Learning using Ensemble classifiers with QBC approach The ensemble classifiers used for QBC approach are gradient boosting classifier, ada boosting classifier and random forest classifier. Described below is this ap- proach in brief.
  • 50.
    27 3.2.1 Query bycommittee In this approach an ensemble of hypotheses is learned and examples that cause maximum disagreement amongst this committee (with respect to the predicted categorization) are selected as the most informative examples from a pool of unla- beled examples. QBC iteratively selects examples to be labeled for training and in each iteration committee of classifiers based on current training set predict labels. Then it evaluates the potential utility of each example in the unlabeled set, and selects a subset of examples with the highest expected utility. The labels for these examples are acquired and they are transferred to the training set. Typically, the utility of an example is determined by some measure of disagreement in the committee about its predicted label. This process is repeated until the number of available requests for labels is exhausted. 3.3 Ensemble Models for Classification of Busi- ness Events using Bag-Of-Words Approach Series of classifiers that were trained on the dataset included SVM, decision-tree classifier, random-forest classifier, ada boost classifier, gradient boosting classifier and SGD classifier. Among these classifiers boosting classifiers and random forest classifiers performed better compared to other classifiers. We used three ensemble classifiers with decision tree as the base learner, namely gradientboostingclassi- fier, adaboostclassifier and randomforestclassifier. In the end, classification of the business event datasets was done by majority voting of these classifiers. The de- scription and mathematical formulation for each ensemble classifier is given below. 3.3.1 Gradient Boosting Classifier Boosting algorithms are set of machine learning algorithms, which builds strong classifier from set of weak classifiers, typically decision tress. Gradient boosting is one such algorithm which builds the model in a stage-wise fashion, and it general- izes the model by allowing optimization of an arbitrary differentiable loss function.
  • 51.
    28 The differentiable lossfunction in our case is Binomial deviance loss function. The algorithm is implemented as follows as described in (Friedman et al.,2001). Input : training set (Xi, yi), where i = 1....n , Xi ∈ H ⊆ Rn and yi ∈ [−1, 1] differential loss function L(y, F(X)) which in our case is Binomial deviance loss function defined as log(1 + exp(−2yF(X))) and M are the number of iterations . 1. Initialize model with a constant value: F0(X) =arg min γ n i=1 L(yi, γ). 2. For m = 1 to M: (a) Compute the pseudo-responses: rim = − ∂L(yi,F(Xi)) ∂F(Xi) F(X)=Fm−1(X) for i = 1, . . . , n. (b) Fit a base learnerhm(X) to pseudo-response, train the pseudo response using the training set {(Xi, rim)}n i=1. (c) Compute multiplier γm by solving the optimization problem: γm = arg min γ n i=1 L (yi, Fm−1(Xi) + γhm(Xi)). (d) Update the model: Fm(X) = Fm−1(X) + γmhm(X). 3. Output FM (X) = M m=1 γmhm(X) The value of the weight γm is found by an approximated newton raphson solution given as γm = Xi∈hm rim Xi∈hm|rim|(2−|rim|) 3.3.2 AdaBoost Classifier In adaBoost we assign (non-negative) weights to points in the data set which are normalized, so that it forms a distribution. In each iteration, we generate a training set by sampling from the data using the weights, i.e. the data point (Xi, yi) would be chosen with probability wi, where wi is the current weight for that data point. We generate the training set by such repeated independent sampling. After learning the current classifier, we increase the (relative) weights of data points that are misclassified by the current classifier. We generate a fresh training set using the modified weights and so on. The final classifier is essentially a weighted majority
  • 52.
    29 voting by allthe classifiers. The description of the algorithm as in (Freund et al., 1995) is given below: Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn , yi ∈ [−1, 1] 1. Initialize: wi(1) = 1 n , ∀i, each data point is initialized with equal weight, so when data points are sampled from the probability distribution the chance of getting the data point in the training set is equally likely. 2. We assume that there as M classifiers within the Ensembles. For m=1 to M do (a) Generate a training set by sampling with wi(m). (b) Learn classifier hm using this training set. (c) let ξm = n i=1 wi(m) I[yi=hm(Xi)] where IA is the indicator function of A and is defined as IA = 1 if [yi = hm(Xi)] IA = 0 if [yi = hm(Xi)] so ξm is the error computed due to the mth classifier. (d) Set αm=log(1−ξm ξm ) computed hypothesis weight, such that αm > 0 be- cause of the assumption that ξ < 0.5. (e) Update the weight distribution over the training set as wi(m + 1)= wi(m) exp(αmI[yi=hm(Xi)]) Normalization of the updated weights so that wi(m+1) is a distribution. wi(m + 1) = wi(m+1) i wi(m+1) end for 3. Output is final vote h(X) = sgn( M m=1 αmhm(x)) is the weighted sum of all classifiers in the ensemble. In the adaboost algorithm M is a parameter. Due to the sampling with weights, we can continue the procedure for arbitrary number of iterations. Loss function used in adaboost algorithm is exponential loss function and for a particular data point its defined as exp(−yif(Xi))
  • 53.
    30 3.3.3 Random ForestClassifiers Random forests are a combination of tree predictors, such that each tree depends on the values of a random vector sampled independently, and with the same dis- tribution for all trees in the forest. The main difference between standard decision trees and random forest is, in decision trees, each node is split using the best split among all variables and in random forest, each node is split using the best among a subset of predictors randomly chosen at that node. In random forest classifier ntree bootstrap samples are drawn from the original data, and for each obtained bootstrap sample grow an unpruned classification decision tree, with the following modification: at each node, rather than choosing the best split among all predic- tors, randomly sample mtry of the predictors and choose the best split from among those variables. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification). The algorithm is described as follows as in(Brieman, 2001): Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn , where D is the whole dataset. for i=1,...,B: 1. Choose a boostrap sample Di from D. 2. Construct a decision Tree Ti from the bootstrap sample Di such that at each node, choose a random subset of m features and only consider splitting on those features. Finally given the testdata Xt take the majority votes for classification. Here B is the number of bootstrap data sets generated from original data set D. 3.4 Multilayer Feed Forward with Back Propa- gation using word embedding approach In this approach word embedding framework was used to convert word to vectors and followed by applying MFN to classify the business event dataset. Genism
  • 54.
    31 module in pythonwas used to build this word embedding, using training of the words on CBOW(continuous bag of words model) or skip gram model of the un- supervised neural language model (Tomas Mikolov et al.,2013), where each word is assigned with an uniformly distributed (U[-1,1]) 100 to 300 dimensonal vector. Once we have initialized vectors for the each word using word embedding, using window based approach, we can convert word vectors into a single global sen- tence vector. The obtained global sentence vector is fed into MFN network with back-propagation for classification of the sentences using soft-max classifier. The following is implementation of the algorithm: 1. Initialization of each word in a sentence with a uniformly distributed (U[- 1,1]) dense vector of 100 to 300 dimension. 2. From a given set of words within a sentence, we concatenate word-embedding vectors to form an matrix for that particular sentence. 3. Choosing an appropriate window size on the obtained matrix and corre- spondingly applying max-pooling approach based on the window size we finally obtain a global sentence vector. 4. The obtained global sentence vectors are fed into multilayer feed forward network with back propagation using soft-max as the loss function. For regularization of the multilayer feed forward network and to avoid overfitting of the data, dropout mechanism is adopted. 3.5 Convolutional Neural Networks for Sentence Classification with unsupervised feature vec- tor learning In this model a simple CNN is trained with one layer of convolution on top of word vectors obtained from an unsupervised neural language model(Yoon kin, 2014). These vectors were trained by (Mikolov et al.,2013) on 100 billion words
  • 55.
    32 Figure 3.1: TheImage describes the architecture for Convolutional Neural Network with Sentence Modelling for multichannel architecture of Google news, and is a publicly available model. The following figure (3.1) de- scribes the architecture of the CNN for sentence modeling. let N be the number of sentences in the vocabulary and n be the number of words in the particular sentence, where xi ∈ Rk be the k-dimensional word vector corre- sponding to the i-th word in the sentence. A sentence of length n (padded where necessary) is represented as x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concate- nation of words xi , xi+1 , . . . , xi+j. The weight vector w is initialized with
  • 56.
    33 a random uniformlydistributed matrix of size Rh×k . A convolution operation involves a filter weight matrix w, which is applied to a window of h words of a par- ticular sentence to produce a new feature. For example, a feature ci is generated from a window of words xi:i+h−1 by ci = f(w · xi:i+h−1 + b). Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence [x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map. c = [c1, c2, ..., cn−h+1] with c ∈ Rn−h+1 , We then apply a max-pooling operation over the feature map and take the maximum value c∗ = max[c] as the feature corresponding to this particular filter. The idea is to capture the most important feature one with the highest value for each feature map. This pooling scheme naturally deals with variable sentence lengths. We have described the process by which one feature is extracted from one filter. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features are also called as unsupervised features, because they are obtained by applications of different filters with variable window sizes randomly. These features form the penultimate layer and are passed to a fully connected soft-max layer whose output is the probability distribution over labels. To avoid overfitting of CNN models, drop-out mechanism is adopted. 3.5.1 Variations in CNN sentence models CNN-rand: Our baseline model where all words are randomly initialized and then modified during training. CNN-static: A model with pre-trained vectors from word2vec. All words in- cluding the unknown ones that are randomly initialized are kept static and only the other parameters of the model are learned. Initializing word vectors with those
  • 57.
    34 obtained from anunsupervised neural language model is a popular method to im- prove performance in the absence of a large supervised training set. We use the publicly available word2vec vectors that were trained on 100 billion words from Google news. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in the set of pre-trained words are initialized randomly.
  • 58.
    Chapter 4 Results andDiscussions In this chapter we discuss about the results obtained from the machine learning algorithms that were applied in our work. 1. Semi-supervised learning approach using naive Bayes with expectation-maximization and active learning with QBC to increase the number of labeled data points. 2. The ensemble classifiers, MFN and CNN models to classify the obtained business data. Described below are the results and analysis of the algorithms. 4.1 Semi-supervised Learning Implementation us- ing Naive Bayes with Expectation Maximiza- tion Initially we had few data points which were labeled in supervised manner. To formulate and solve the problem as Business Event Classification problem, our primary objective was to increase the number of labeled data points. In accordance with the algorithm of semi-supervised learning using naive Bayes classifier with expectation maximization explained in section 3.1, the following are the results in three domains of acquisition, vendor-supplier and job events with the training data taken as 30%, 40% and 50% of the whole dataset and rest of the pool as test data. 35
  • 59.
    36 4.1.1 Results andAnalysis of Vendor-Supplier Event Data Vendor-supplier data points labeled in supervised manner were 754. Stated below are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.1) and figure (4.1) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.2),(4.4) and(4.6) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and the remaining corresponding part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.3),(4.5) and (4.7) displays the ROC curves for variations of 30% , 40% and 50 % as training data and corresponding remaining part as the test data. Analysis: We observe an increase in accuracy and F-scores as there is an in- crease in the number of training data points which is as expected. But increase in accuracies are of higher values compared to the increase in F-scores, because number of true negatives are more in number compared to true positives. The confusion matrix plot shows slight variations in number of true positives and true negatives as the number of training data points are increased. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 60.
    37 Table 4.1: Variationin accuracies and F-scores in Semi-supervised learning using naive Bayes for vendor-supplier data Semi-supervised learning using naive Bayes for vendor-supplier dataset Training data points in per- centage Accuracy F-scores Description on dataset 30 0.5597 0.5915 Testing data=527,training data=227 40 0.7434 0.65 Testing data=454,training data=300 50 0.7765 0.674 Testing data=376,training data=376 Figure 4.1: Variations in Accuracies and F1-scores for Vendor-supplier data using Naive-Bayes, semi-supervised technique
  • 61.
    38 Figure 4.2: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for VNSP Figure 4.3: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for VNSP
  • 62.
    39 Figure 4.4: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for VNSP Figure 4.5: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for VNSP
  • 63.
    40 Figure 4.6: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for VNSP Figure 4.7: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for VNSP
  • 64.
    41 4.1.2 Results andAnalysis for Job Event Data Job event data points labeled in supervised manner were 2810 data points. Stated below some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.2) and figure (4.8) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.9),(4.11) and(4.13) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.10),(4.12) and (4.14) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: As there is an increase in the number of training data, we observe an increase in accuracy and F-scores. But there is a vast difference in values of accuracies compared to F-scores, because the number of true negatives are very high in comparison to true positives which are low in number, which is clearly visible in our confusion matrix plot. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 65.
    42 Table 4.2: Variationin accuracies and F-scores in Semi-supervised learning using naive Bayes for Job event data Semi-supervised learning using naive Bayes for Job dataset Training data points in per- centage Accuracy F-scores Description on data 30 0.7483 0.4444 Testing data=1967,training data=842 40 0.7544 0.4863 Testing data=1686,training data=1123 50 0.8014 0.52 Testing data=1405,training data=1404 Figure 4.8: Variations in Accuracies and F1-scores for Job event data using Naive-Bayes, semi-supervised technique
  • 66.
    43 Figure 4.9: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for JOB Figure 4.10: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for JOB
  • 67.
    44 Figure 4.11: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for JOB Figure 4.12: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for JOB
  • 68.
    45 Figure 4.13: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for JOB Figure 4.14: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for JOB
  • 69.
    46 4.1.3 Result andAnalysis for Acquisition Event Data Acquisition event data points labeled in supervised manner were 1380 data points. Stated below are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.3) and figure (4.15) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.16),(4.18) and(4.20) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false-positives and false-negatives. Figures (4.17),(4.19) and (4.21) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: There is an increase in the accuracy and F-scores as there is increase in the number of training data points. Increase in F-scores are slightly higher compared to the increase in accuracies. Because number of true positives are more in compared to true negatives, due to this classifier is more biased towards the positive class compared to negative. So the amount of false positives are higher in this scenario, which is clearly visible from the confusion matrix plots. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 70.
    47 Table 4.3: Variationin accuracies and F-scores in Semi-supervised learning using naive Bayes for Acquisition event data Semi-supervised learning using naive Bayes for Acquisition dataset Training data points in per- centage Accuracy F-scores Description on data 30 0.7929 0.8178 Testing data=966,training data=413 40 0.7989 0.82 Testing data=828,training data=521 50 0.8057 0.8241 Testing data=689,training data=690 Figure 4.15: Variations in Accuracies and F1-scores for Acquisition event data using Naive-Bayes, semi-supervised technique
  • 71.
    48 Figure 4.16: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition Figure 4.17: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition
  • 72.
    49 Figure 4.18: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition Figure 4.19: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition
  • 73.
    50 Figure 4.20: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for Acquisition Figure 4.21: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Acquisition
  • 74.
    51 4.2 Active Learningimplementation by Query by committee approach In accordance with the algorithm of active learning explained in section 3.1.2, following are some of the results in three domains of acquisition, vendor-supplier and job events with the training data taken as 30% ,40% and 50 % of the whole dataset and prediction of test data using majority voting of three ensemble clas- sifiers gradient boosting classifier, ada boost classifier and random forest classifier (i.e. query by committee approach). 4.2.1 Results and Analysis for Vendor-Supplier Event Data Vendor-supplier data points labeled in supervised manner were 754 data points. Following are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.4) and figure (4.22) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.23),(4.25) and(4.27) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.24),(4.26) and (4.28) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: We observe an increase in accuracy and F-scores, as there is increase in the number of training data. But increase in accuracies are of higher values compared to the increase in F-scores because number of true negatives are more in compared to true positives. This method performs better compared to the semi- supervised naive Bayes classifier. The confusion matrix plot shows slight variations in number of true positives and true negatives as the number of training data points are increased. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 75.
    52 Table 4.4: Variationin accuracies and F-scores using Active Learning for Vendor-supplier event data Active Learning using QBC approach Training data points in per- centage Accuracy F-scores Description on data 30 0.842 0.7348 Testing data=529,training data=225 40 0.84 0.7352 Testing data=454,training data=300 50 0.8643 0.76 Testing data=376,training data=376 Figure 4.22: variations in Accuracies and F1-scores for Vendor-supplier data using Active learning
  • 76.
    53 Figure 4.23: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier Figure 4.24: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier
  • 77.
    54 Figure 4.25: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for Vendor-Supplier Figure 4.26: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Vendor-supplier
  • 78.
    55 Figure 4.27: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier Figure 4.28: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier
  • 79.
    56 4.2.2 Result andAnalysis for Job Event Data Job event data points labeled in supervised manner were 2809 data points. Fol- lowing are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data.Table (4.5) and figure (4.29) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.30),(4.32) and(4.34) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.31),(4.33) and (4.35) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: As there is increase in the number of training data, we observe an increase in accuracy and F-scores. But there is a vast difference in accuracies compared to F-scores, because number of true negatives are very high in compared to true positives which are low in number which is clearly visible in our confusion matrix plot. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints. The performance of this method is better compared to the semi-supervised naive Bayes classifier which is clearly visible from our results.
  • 80.
    57 Table 4.5: Variationin accuracies and F-scores using Active Learning for Job event data Active Learning using QBC approach Training data points in per- centage Accuracy F-scores Description on data 30 0.9054 0.6204 Testing data=1967,training data=842 40 0.9116 0.6558 Testing data=1686,training data=1123 50 0.9216 0.6758 Testing data=1405,training data=1404 Figure 4.29: Variations in Accuracies and F1-scores for Job event data using Active learning
  • 81.
    58 Figure 4.30: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for Job Figure 4.31: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Job
  • 82.
    59 Figure 4.32: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for Job Figure 4.33: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Job
  • 83.
    60 Figure 4.34: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for Job Figure 4.35: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job
  • 84.
    61 4.2.3 Result andAnalysis for Acquisition Event Data Acquisition event data points labeled in supervised manner were 1380 data points. Following are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.6) and figure (4.36) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.37),(4.39) and(4.41) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.38),(4.40) and (4.42) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: There is an increase in the accuracy and F-scores as there is increase in the number of training data points. Increase in F-scores are equivalent to the increase in accuracies. The confusion matrix plots show that the number of true positives and true negatives are nearly equal in number. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of train- ing datapoints. This method shows slight improvement in accuracies compared to the semi-supervised naive Bayes classifier.
  • 85.
    62 Table 4.6: Variationin accuracies and F-scores using Active Learning for Acquisition event data Active Learning using QBC approach Training data points in per- centage Accuracy F-scores Description on data 30 0.7855 0.7549 Testing data=966,training data=413 40 0.812 0.7867 Testing data=828,training data=521 50 0.82 0.7995 Testing data=689,training data=690 Figure 4.36: Variations in Accuracies and F1-scores for Acquisition event data using Active learning
  • 86.
    63 Figure 4.37: Confusionmatrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition Figure 4.38: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition
  • 87.
    64 Figure 4.39: Confusionmatrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition Figure 4.40: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition
  • 88.
    65 Figure 4.41: Confusionmatrix for large pool of testing data of 50 percent and training data of 50 percent for Job Figure 4.42: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job 4.3 Comparison of Semi-supervised techniques and Active learning approach Active learning using query by committee approach that comprises of three en- semble classifiers that form the committee (i.e. gradient boosting, ada boost and
  • 89.
    66 randomforest classifiers) performsbetter compared to semi-supervised naive Bayes probabilistic model in terms of accuracy, for all three domains of acquisition event, vendor-supplier event and job event, and for the corresponding variations in train- ing data as 30%, 40% and 50% . Active learning using query with committee approach was implemented in our work to increase the number of labeled data points from the pool of unlabeled data. 4.4 Results of Ensemble Classifiers with differ- ent Parameter tuning The parameter to be tuned was the number base-learners used within ensem- ble classifier. The base-learners in our case were decision trees. Following are some observations with 100 and 500 base-learning classifiers within the ensemble, which were varied as part of parameter tuning. The ensemble classifiers used were gradient boosting classifiers, adaboost classifiers and random forest classifiers for implementation. Finally the majority voting of these three ensemble classifiers was performed to predict the test data in order to increase the accuracy of test data. 4.4.1 Analysis for vendor-supplier event Data using 100 estimators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was around 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training data and 20 % as testing data. Cross validation used is 5 fold for tuning of the parameter. The figure(4.43) and the tables (4.7), (4.8) and (4.9) display the vari- ation in training score i.e. 5-fold, accuracies and F-scores for gradient Boosting classifier, ada boosting classifier and random forest classifier.
  • 90.
    67 Table 4.7: Variationin accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in vendor-supplier data set Gradient Boosting classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.9084 0.8248 2 0.8936 0.8119 3 0.8921 0.8033 4 0.8921 0.7906 5 0.90384 0.8181 Table 4.8: Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in vendor-supplier data Ada-boost classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.8862 0.7979 2 0.8833 0.7937 3 0.8788 0.7743 4 0.8788 0.7783 5 0.8994 0.8055 Table 4.9: Variation in accuracies and F-scores for random forest classifier for number of parameter estimate as 100 in vendor-supplier data Random forest classifier Vendor-supplier 5-folds Accuracy F-scores 1 0.9143 0.8309 2 0.9054 0.8192 3 0.8995 0.8045 4 0.9158 0.8209 5 0.9082 0.8152
  • 91.
    68 Figure 4.43: variationsin Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers The following table (4.10) describes the test score for vendor-supplier data with 100 estimators, obtained by majority voting of three ensemble classifiers gradient boosting, ada boost and random forest classifiers. The figure (4.44) shows the confusion matrix display for the test data, which gives the understanding regard- ing the true-positives, true-negatives, false-positives and false-negatives.The figure (4.45) displays the ROC curve. Table 4.10: Variation in test score for accuracy and F-score with Vendor- supplier data using voting of three ensemble classifiers with number of estimators as 100 Test score for Vendor-supplier using voting of three ensemble classifiers with number of estimators as 100 Area under ROC Accuracy F-scores Confusion matrix values 87% 90.9% 83.511% truepositives=195,falsepositives=16,truenegatives=575, falsenegatives=61
  • 92.
    69 Figure 4.44: Confusionmatrix for Vendor-supplier with number of estimators as 100 Figure 4.45: Roc curve for for Vendor-supplier with number of estimators as 100 4.4.2 Analysis for Job event Data using 100 estimators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was around 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training
  • 93.
    70 data and 20% as testing data. Cross validation used is 5 fold for tuning of the parameter. The figure(4.46) and the tables (4.11), (4.12) and (4.13) display the variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.11: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for number of parameter estimate as 100 in Job data set Gradient Boosting classifier for Job 5-folds Accuracy F-scores 1 0.8670 0.7593 2 0.8961 0.7710 3 0.8743 0.77 4 0.8797 0.75 5 0.8870 0.7686 Table 4.12: Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Job data set Ada-boost classifier for Job 5-folds Accuracy F-scores 1 0.8561 0.7748 2 0.8907 0.7744 3 0.8761 0.7671 4 0.8688 0.7592 5 0.8925 0.7958 Table 4.13: Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Job data set Random forest classifier Job 5-folds Accuracy F-scores 1 0.8888 0.7664 2 0.8816 0.7530 3 0.8943 0.7870 4 0.9052 0.7846 5 0.8998 0.7971
  • 94.
    71 Figure 4.46: Variationsin Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers The following table (4.14) describes the test score for job data with 100 estima- tors, obtained by majority voting of three ensemble classifiers gradient boosting, ada boost and random forest classifiers. The figure (4.47) shows the confusion matrix display for the test data, which gives the understanding regarding the true-positives, true-negatives, false-positives and false-negatives. The figure (4.48) displays the ROC curve. Table 4.14: variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 100 Test score for Job using voting of three ensemble classifiers with number of estimators as 100 Area under ROC Accuracy F-scores Confusion matrix values 83% 90.3% 81.97% truepositives=141,falsepositives=15,truenegatives=484, falsenegatives=47
  • 95.
    72 Figure 4.47: Confusionmatrix for Job with number of estimators as 100 Figure 4.48: Roc curve for for Job with number of estimators as 100 4.4.3 Analysis for Acquisition event Data using 100 esti- mators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was round 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
  • 96.
    73 parameter. The figure(4.49)and the tables (4.15), (4.16) and (4.17) display the variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.15: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for number of parameter estimate as 100 in Acquisition data set Gradient Boosting classifier for Acquisition 5-folds Accuracy F-scores 1 0.9173 0.8699 2 0.9230 0.8605 3 0.9301 0.8657 4 0.9031 0.8416 5 0.9230 0.8606 Table 4.16: Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Acquisition data set Random forest classifier for Acquisition 5-folds Accuracy F-scores 1 0.9287 0.9026 2 0.9287 0.8803 3 0.9472 0.9226 4 0.9273 0.8939 5 0.9458 0.9037 Table 4.17: Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Acquisition data set Ada-boost Classifier for Acquisition 5-folds Accuracy F-scores 1 0.9330 0.8941 2 0.9245 0.8798 3 0.9430 0.8924 4 0.9245 0.8819 5 0.9344 0.8931
  • 97.
    74 Figure 4.49: Variationsin Accuracies and F1-scores for Acquisition data for 5-fold using 3 ensemble classifiers The following table (4.18) describes the test score for acquisition data with 100 es- timators, obtained by majority voting of three ensemble classifiers gradient boost- ing, ada boost and random forest classifiers. The figure (4.50) shows the confu- sion matrix display for the test data, which gives the understanding regarding the true-positives, true-negatives, false-positives and false-negatives. The figure (4.51) displays the ROC curve. Table 4.18: Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of estimators as 100 Test score for Acquisition using voting of three ensemble classifiers with number of estimators as 100 Area under ROC Accuracy F-scores Confusion matrix values 89% 92.25% 87.022% truepositives=228,falsepositives=17,truenegatives=582, falsenegatives=51
  • 98.
    75 Figure 4.50: Confusionmatrix for Acquisition with number of estimators as 100 Figure 4.51: Roc curve for for Acquisition with number of estimators as 100 4.4.4 Analysis for Vendor-Supplier event Data using 500 estimators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was around 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training
  • 99.
    76 data and 20% as testing data. Cross validation used is 5 fold for tuning of the parameter. The figure(4.52) and the tables (4.19), (4.20) and (4.21) display the variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.19: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for number of parameter estimate as 500 in vendor-supplier data set Gradient Boosting classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.8892 0.7761 2 0.8936 0.7906 3 0.8995 0.7975 4 0.9054 0.8083 5 0.8934 0.8091 Table 4.20: Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in vendor-supplier data set Ada-boost classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.9010 0.8112 2 0.8862 0.7924 3 0.9054 0.8202 4 0.9054 0.8181 5 0.8846 0.8088 Table 4.21: Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in vendor-supplier data set Random forest classifier Vendor-supplier 5-folds Accuracy F-scores 1 0.8966 0.7818 2 0.9054 0.8235 3 0.9098 0.8085 4 0.9054 0.8096 5 0.9038 0.8380
  • 100.
    77 Figure 4.52: Variationsin Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers The following table (4.22) describes the test score for vendor-supplier data with 500 estimators, obtained by majority voting of three ensemble classifiers gradient boosting, ada boost and random forest classifiers. The figure (4.53) shows the confusion matrix display for the test data, which gives the understanding regard- ing the true-positives, true-negatives, false-positives and false-negatives and figure (4.54) displays the ROC curve. Table 4.22: Variation in test score for accuracy and F-score with vendor- supplier data using voting of three ensemble classifiers with number of estimators as 500 Test score for Vendor-supplier using voting of three ensemble classifiers with number of estimators as 500 Area under ROC Accuracy F-scores Confusion matrix values 88% 91.97% 85.211% truepositives=196,falsepositives=16,truenegatives=583, falsenegatives=52
  • 101.
    78 Figure 4.53: Confusionmatrix for Vendor-supplier with number of estimators as 500 Figure 4.54: Roc curve for for Vendor-supplier with number of estimators as 500 4.4.5 Analysis for Job event Data using 500 estimators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was around 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training
  • 102.
    79 data and 20% as testing data. Cross validation used is 5 fold for tuning of the parameter. The figure(4.55) and the tables (4.23), (4.24) and (4.25) display the variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.23: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for number of parameter estimate as 500 in Job data set Gradient Boosting classifier for Job 5-folds Accuracy F-scores 1 0.8943 0.7944 2 0.9107 0.8297 3 0.9016 0.7812 4 0.9107 0.8167 5 0.9034 0.81 Table 4.24: Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in Job data set Ada-boost classifier for Job 5-folds Accuracy F-scores 1 0.8834 0.7894 2 0.8888 0.7859 3 0.9016 0.8029 4 0.8870 0.7769 5 0.8816 0.7653 Table 4.25: Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Job data set Random forest classifier Job 5-folds Accuracy F-scores 1 0.8761 0.7585 2 0.9016 0.7806 3 0.8888 0.7698 4 0.8888 0.7630 5 0.9016 0.7704
  • 103.
    80 Figure 4.55: Variationsin Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers The following table (4.26) describes the test score for job data with 500 estima- tors, obtained by majority voting of three ensemble classifiers gradient boosting , ada boost and random forest classifiers. The figure (4.56) shows the confu- sion matrix display for the test data, which gives the understanding regarding the true-positives, true-negatives, false-positives and false-negatives. The figure (4.57) displays the ROC curve. Table 4.26: variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 500 Test score for Job using voting of three ensemble classifiers with number of estimators as 500 Area under ROC Accuracy F-scores Confusion matrix values 87.56% 92.3% 83.88% truepositives=149,falsepositives=16,truenegatives=486, falsenegatives=36
  • 104.
    81 Figure 4.56: Confusionmatrix for Job with number of estimators as 500 Figure 4.57: Roc curve for for Job with number of estimators as 500 4.4.6 Analysis for Acquisition event Data using 500 esti- mators within the ensemble as the parameter The number of data points obtained after implementation of active learning ap- proach was around 4500 data points. Following are the observations obtained with three ensemble classifiers using the bag of words approach with 80% as training data and 20 % as testing data. Cross validation used is 5 fold for tuning of the
  • 105.
    82 parameter. The figure(4.58)and the tables (4.27), (4.28) and (4.29) display the variation in training score i.e. 5-fold, accuracies and F-scores for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.27: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for number of parameter estimate as 500 in Acquisition data set Gradient Boosting classifier for Acquisition 5-folds Accuracy F-scores 1 0.9230 0.8693 2 0.9245 0.8564 3 0.9259 0.8735 4 0.9131 0.86238 5 0.9216 0.8648 Table 4.28: Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Acquisition data set Random forest classifier for Acquisition 5-folds Accuracy F-scores 1 0.9458 0.9121 2 0.9373 0.8976 3 0.9444 0.9107 4 0.9188 0.8758 5 0.9415 0.9061 Table 4.29: Variation in accuracies and F-scores for Ada boosting classifier for number of parameter estimate as 500 in Acquisition data set Ada-boost Classifier for Acquisition 5-folds Accuracy F-scores 1 0.9358 0.8872 2 0.9316 0.8953 3 0.9358 0.8878 4 0.9230 0.8717 5 0.9387 0.8997
  • 106.
    83 Figure 4.58: Variationsin Accuracies and F1-scores for Acquisition data for 5-fold using 3 ensemble classifiers The following table (4.30) describes the test score for acquisition data with 500 es- timators, obtained by majority voting of three ensemble classifiers gradient Boost- ing, ada boost and random forest classifiers. The figure (4.59) shows the confu- sion matrix display for the test data, which gives the understanding regarding the true-positives, true-negatives, false-positives and false-negatives. The figure (4.60) displays the ROC curve. Table 4.30: Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of estimators as 500 Test score for Acquisition using voting of three ensemble classifiers with number of estimators as 500 Area under ROC Accuracy F-scores Confusion matrix values 92% 94.21% 91.10% truepositives=245,falsepositives=8,truenegatives=591, falsenegatives=34
  • 107.
    84 Figure 4.59: Confusionmatrix for Acquisition with number of estimators as 500 Figure 4.60: Roc curve for for Acquisition with number of estimators as 500 4.5 Final Accuracies and F-score estimates for the model Once the parameters are tuned, the test analysis of the whole data set was per- formed. In this process we train the model using whole dataset, check the overall
  • 108.
    85 accuracies and F-scoresusing 5 fold cross-validation. The overall analysis is dis- cussed in subsequent subsections. 4.5.1 Final Analysis of Vendor-Supplier Dataset Vendor-supplier dataset consists of around 4500 data points. After tuning of the parameter i.e. the number of estimators to value of 500, which gave us the best performance by cross-validation. The following are results obtained with this tuned parameter value on the whole data set. The figure(4.61) and the tables (4.31), (4.32) ,(4.33) display the variation in training score i.e. 5-fold, whole data set accuracies and F-scores, for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.31: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for whole vendor-supplier data set Gradient Boosting classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.9067 0.83511 2 0.8912 0.7925 3 0.8947 0.8044 4 0.9172 0.8529 5 0.9219 0.8538 Table 4.32: Variation in accuracies and F-scores for Ada Boosting classifier for whole vendor-supplier data set Ada-boost classifier for Vendor-supplier 5-folds Accuracy F-scores 1 0.9020 0.8329 2 0.8947 0.8026 3 0.8841 0.7958 4 0.9054 0.8336 5 0.8983 0.8123
  • 109.
    86 Table 4.33: Variationin accuracies and F-scores for Random forest classifier for whole vendor-supplier data set Random forest classifier Vendor-supplier 5-folds Accuracy F-scores 1 0.9173 0.8395 2 0.9018 0.8019 3 0.9030 0.8137 4 0.9148 0.8509 5 0.9137 0.8212 Figure 4.61: Variations in Accuracies and F1-scores for Vendor-supplier data for whole data set 4.5.2 Final Analysis of Job Dataset Job dataset consists of around 4500 data points. After tuning of the parameter i.e. the number of estimators to value of 500, which gave us the best performance
  • 110.
    87 by cross-validation. Thefollowing are results obtained with this tuned parameter value on the whole data set. The figure(4.62) and the tables (4.34), (4.35), (4.36) display the variation in training score i.e. 5-fold, whole data set accuracies and F-scores, for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.34: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for whole Job data set Gradient Boosting classifier for Job 5-folds Accuracy F-scores 1 0.9053 0.8134 2 0.9010 0.7841 3 0.9125 0.8125 4 0.9139 0.8478 5 0.9154 0.8429 Table 4.35: Variation in accuracies and F-scores for Ada Boosting classifier for whole Job data set Ada-boost classifier for Job 5-folds Accuracy F-scores 1 0.9112 0.8235 2 0.9010 0.7952 3 0.8848 0.7830 4 0.9037 0.8272 5 0.9023 0.8152
  • 111.
    88 Table 4.36: Variationin accuracies and F-scores for Random forest classifier for whole Job data set Random forest classifier Job 5-folds Accuracy F-scores 1 0.9112 0.8051 2 0.8981 0.8 3 0.8965 0.7801 4 0.9139 0.8209 5 0.8921 0.81 Figure 4.62: Variations in Accuracies and F1-scores for Job data for whole data set 4.5.3 Final Analysis of Acquisition Dataset Acquisition dataset consists of around 4500 data points. After tuning of the pa- rameter i.e. the number of estimators to value of 500, which gave us the best
  • 112.
    89 performance by cross-validation.The following are results obtained with this tuned parameter value on the whole data set. The figure(4.63) and the tables (4.37), (4.38), (4.39) display the variation in training score i.e. 5-fold, whole data set accuracies and F-scores, for gradient boosting classifier, ada boosting classifier and random forest classifier. Table 4.37: Variation in accuracies and F-scores for Gradient Boosting clas- sifier for whole Acquisition data set Gradient Boosting classifier for Acquisition 5-folds Accuracy F-scores 1 0.9202 0.8689 2 0.9373 0.8924 3 0.9396 0.8846 4 0.9429 0.9084 5 0.9293 0.8872 Table 4.38: Variation in accuracies and F-scores for Random forest classifier for whole Acquisition data set Random forest classifier for Acquisition 5-folds Accuracy F-scores 1 0.9328 0.8851 2 0.9544 0.9161 3 0.9487 0.9150 4 0.9452 0.9100 5 0.9441 0.9039
  • 113.
    90 Table 4.39: Variationin accuracies and F-scores for Ada boosting classifier for whole Acquisition data set Ada-boost Classifier for Acquisition 5-folds Accuracy F-scores 1 0.9362 0.8992 2 0.9464 0.9101 3 0.9362 0.8943 4 0.9407 0.9057 5 0.9395 0.9016 Figure 4.63: Variations in Accuracies and F1-scores for Acquisition data 5- folds accuracy variations for whole data set
  • 114.
    91 4.6 Results obtainedfor MFN with Word Em- bedding The results obtained for this model was satisfactory. The classifier prediction for this model was not accurate. The global sentence vector was generated based on code logic from the set of word-embedding vectors. Following is illustration of vendor-supplier test scores on a test data set of 225 data points. Similarly we obtained failure results for job and acquistion events. Table 4.40: Variation in test score for MFN with word embedding Test score for MFN with word embedding on vendor-supplier dataset Accuracy F-score Confusion matrix values 0.65 .39 True-negatives=140, True-positive = 13, false-positives = 3, false-negatives = 69 4.7 Results obtained for Convolutional Neural Networks In convolutional neural network analysis is performed for each word initialized with uniformly distributed random vector U[−1, 1] i.e. CNN-rand and CNN-word2vec models which are as described in the section(3.5). Following displayed are the results and analysis for both CNN models with 3-fold cross validation on the whole data set. 4.7.1 Analysis for Vendor-Supplier Data using CNN-rand and CNN-word2vec Model Shape of the input matrix for vendor-supplier was 2515×300, which was maximum sentence length × dimension of the corresponding word. The filter shapes used to extract features were 3×300, 4×300 and 5×300. The dimension of the hidden units was 100×2 dimension. The activation function used was RELU. Drop-out
  • 115.
    92 rate is 0.5and learning rate of 0.95. Following are results for CNN-rand and CNN- word2vec for 3-fold cross validation. The table (4.40) and figure (4.64) shows the variation in accuracies and F-scores CNN-rand and CNN-word2vec models. Table 4.41: Variation in accuracies and F-scores CNN-rand and CNN- word2vec models for Vendor-supplier on whole data set CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set 3-folds CNN-rand accu- racy CNN-word2vec accuracy 1 0.90049 0.91044 2 0.9168 0.9193 3 0.9070 0.92035 The average accuracy for CNN-rand is 0.9081 and for CNN-word2vec is 0.9167. Figure 4.64: CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set with 3-folds
  • 116.
    93 4.7.2 Analysis forAcquisition Data using CNN-rand and CNN-word2vec Model Shape of the input matrix for acquisition was 580×300, which was maximum sentence length × dimension of the corresponding word. The filter shapes used to extract features were 3×300, 4×300 and 5×300. The dimension of the hidden units was 100×2 dimension. The activation function used was RELU. Drop-out rate is 0.5 and learning rate of 0.95. Following are results for CNN-rand and CNN- word2vex for 3-fold cross validation. The table (4.41) and figure (4.65) shows the variation in accuracies and F-scores CNN-rand and CNN-word2vec models. Table 4.42: Variation in accuracies and F-scores CNN-rand and CNN- word2vec models for Acquisition on whole data set CNN-rand and CNN-word2vec models for Acquisition on whole data set 3-folds CNN-rand accu- racy CNN-word2vec accuracy 1 0.9439 0.9672 2 0.9251 0.9705 3 0.9386 0.9613 The average accuracy for CNN-rand is 0.9359 and for CNN-word2vec is 0.966. 4.7.3 Analysis for Job using CNN-rand and CNN-word2vec Model Shape of the input matrix for job was 1192×300, which was maximum sentence length × dimension of the corresponding word. The filter shapes used to extract features were 3×300, 4×300 and 5×300. The dimension of the hidden units was 100×2 dimension. The activation function used was RELU. Drop-out rate was 0.5 and learning rate of 0.95. Following are results for CNN-rand and CNN-word2vec for 3-fold cross validation. The table (4.42) and figure (4.66) shows the variation in accuracies and F-scores CNN-rand and CNN-word2vec models.
  • 117.
    94 Figure 4.65: CNN-randand CNN-word2vec models for Acquisition on whole data set with 3-folds Table 4.43: Variation in accuracies and F-scores CNN-rand and CNN- word2vec models for Job on whole data set CNN-rand and CNN-word2vec models for Job on whole data set 3-folds CNN-rand accu- racy CNN-word2vec accuracy 1 0.7951 0.8226 2 0.8005 0.7941 3 0.8181 0.8357 The average accuracy for CNN-rand is 0.8046 and for CNN-word2vec is 0.8108.
  • 118.
    95 Figure 4.66: CNN-randand CNN-word2vec models for Job on whole data set with 3-folds 4.8 Result Analysis Given below is the description of results and analysis: 1. Active learning using query by committee approach comprises of three en- semble classifiers that form the committee (i.e. gradient boosting, ada boost and randomforest classifiers) performs better compared to semi-supervised naive Bayes probabilistic model. 2. All the three ensemble classifiers performance in accuracies and F-scores were consistent and good for the respective business event datasets. 3. CNN-word2vec models performed better compared to the CNN-rand models on all three business event datasets.
  • 120.
    Chapter 5 Conclusions andFuture work Extraction of vital information from unstructured text is a hard problem. In the following sections, we discuss challenges encountered in our work of business event recognition, summary of our work and the future scope of our work. 5.1 Challenges Encountered in Business Event Recognition The identification of business event sentences from online news articles is tedious and difficult task. Following are the discussions regarding the challenges encoun- tered while performing our work on business event recognition. 1. Uncertainty in extracting amount of data and all the variations possible in the business event data: The data extraction should have been in an appropriate manner, to include all possible variations of business news describing that particular event. So that classification of business event could have been performed under all variations. This to a certain extent was not possible, because we wrote crawlers to extract business news data from different websites. Information extracted from this crawled text was labeled and used for training of the model, so the models failed to capture the business event sentences which were not possibly present in the training model as one of the variant of business event. This lead to increased number of false negatives compared to false positives which reflects in our models. 97
  • 121.
    98 2. Application ofactive learning methods was time consuming.: The method of active learning involved querying of most informative examples by the user and after querying we retrain our classifiers with this newly added labeled data points. And the above process was repeated. The process was time consuming, because we required a domain expertise to query the labels. 3. Business event datasets were unstructured: Understanding of the pat- terns and extracting useful features from the business event datasets were difficult because they were unstructured. 4. Bag of words vectorizers fail to capture the exact meaning of the word.: In the bag of words approach there was a disregard for both gram- mar and capturing exact meaning of the word within the sentence. So the analysis performed using this approach in business event recognition had its drawbacks which is illustrated as shown below: example:The model will recognize both the sentences as described below as an acquisition event, because meaning of acquisition was not distinct in bag of words model. a)Google acquired a land for developing its office. b)Google acquired yahoo as an organization. So we had to develop models which would overcome this problem in classi- fication of business event which was challenging. 5. Restricted analysis on CNN models: The algorithm runtime for CNN models with 3-fold cross-validation was around 20 hours(on a confiuration of core-i5 processor) for each of the business event datasets. So the analysis was restricted to 3-fold cross validation each of the respective business event datasets. 5.2 Conclusions An automated model for recognizing business events in three domains i.e. acquisi- tion, vendor-supplier and job was developed which was the main objective of our
  • 122.
    99 project. Our developedmodel was able to predict the business event sentences and give out additional information such as organizations and persons involved in that business event sentence which was our desired output. In the bag of words approach, Tf-idf vectorizers performed better compared to the count-vectorizers when they were used in the analysis of the ensemble classifiers to classify the business event data. In the conversion of word to vectors using the word-embedding and word2vec, models capture the context of the sentence to a certain extent in comparison the bag of words model which fail, because of disregard of order and grammar of a word within a sentence. In the semi-supervised approach from our results and analysis we find that the active learning approach using ensemble learning gives us better results com- pared to naive Bayes classifier with expectation maximization in all the three domains of business events i.e. acquisition, vendor-supplier and job. In the acquisition business event CNN models with cross validation on the whole data set gives better accuracies compared to the ensemble classifiers with cross validation on the whole data set, the accuracies for vendor-supplier dataset is nearly same in both the cases and for the job event dataset ensemble classifiers perform better compared to the CNN models. 5.3 Future works As future scope of our project the three major issues we can take up are, the problem of co-reference resolution, exhaustive analysis on the CNN models and the application of HMM on our model. Stated below is the description of these three future works. 1. The problem of co-referencing resolution exits in our model, which is the identification of noun phrases and other terms that refer to the nouns such as her,him,it,that,their,them etc. example: ISRO acquired their organization.
  • 123.
    100 here what organization”their” is referring to is unknown, which is the prob- lem of coreference resolution. 2. CNN model is restricted to 3-fold cross-validation for whole data set after parameter tuning with three fold cross validation, further exhaustive analysis can be performed using 5-fold and 10-fold cross validation to improve the overall performance of the model. 3. After the classification of business event sentences. If a false positive is obtained which contains the same keyword as in the true positive, then such kind of false positives can be removed by building of HMM models. The illustrative example is described below. a)Google has acquired a plot in U.S.A b)Google is going purchase a land. These two sentences are false positives, so probablistic model using HMM can be applied on such kind of false positive classified sentences and converted into true negatives.
  • 124.
    Bibliography [1] Marujo, Luis,Wang Ling, Anatole Gershman, Jaime Carbonell, and Joo P. Neto2 David Matos. Recognition of Named-Event Passages in News Articles. In 24th International Conference on Computational Linguistics, pp.321-329. 2012. [2] Marujo, Luis, Anatole Gershman, Jaime Carbonell, Robert Frederking, and Joo P. Neto. Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization.In proceedings of 8th international conference on Language Resources and Evaluvation(LREC) ,pp.156-162. 2012. [3] Su, Jiang, Jelber S. Shirab, and Stan Matwin. Large scale text classification using semi-supervised multinomial naive bayes. In Proceedings of the 28th In- ternational Conference on Machine Learning (ICML-11), pp. 97-104. 2011. [4] Kim, Yoon. Convolutional Neural Networks for Sentence Classifica- tion.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751. 2014. [5] Nigam, Kamal, Andrew McCallum, and Tom Mitchell. Semi-supervised text classification using EM. Semi-Supervised Learning,pp 33-56. 2006. [6] Friedman, Jerome H.Greedy function approximation: a gradient boosting ma- chine. Annals of statistics:pp 1189-1232. 2001. [7] Freund, Yoav, and Robert E. Schapire. A desicion-theoretic generalization of on-line learning and an application to boosting. In Computational learning theory, pp. 23-37. Springer Berlin Heidelberg, 1995. 101
  • 125.
    102 [8] Breiman, Leo.Random forests. Machine learning 45, no. 1 (2001),pp. 5-32. 2001. [9] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estima- tion of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [10] Abe, N., and Mamitsuka, H. Query learning strategies using boosting and bagging. Proceedings of 15th International Conferenec on Machine Learning (ICML-98), pp. 1-10. 1998. [11] Prem Melville and Raymond J. Mooney.Diverse Ensembles for Active Learning.Proceedings of the 21st International Conference on Machine Learning,(ICML-2004), pp. 584-591. 2004. [12] Ramos, Juan.Using tf-idf to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning. 2003. [13] Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuglu, P. Kuksa. Natu- ral Language Processing (Almost) from Scratch. Journal of Machine Learning Research 12,pp. 2493-2537. 2011. [14] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incor- porating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Com- putational Linguistics (ACL 2005), pp. 363-370. 2005. [15] Manning, Christopher D., Surdeanu, Mihai, Bauer, John, Finkel, Jenny, Bethard, Steven J., and McClosky, David. 2014.The Stanford CoreNLP Nat- ural Language Processing Toolkit. In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations(ACL- 2014), pp. 55-60.2014. [16] Hobbs, Jerry R.; Walker, Donald E.; Amsler, Robert A. (1982).Natural lan- guage access to structured text Proceedings of the 9th conference on Compu- tational linguistics 1. pp. 127-132.
  • 126.
    103 [17] Tjong KimSang, E. F., De Meulder, F. (2003, May). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT- NAACL 2003-Volume 4 (pp. 142-147). Association for Computational Linguis- tics. [18] Nadeau, D., Sekine, S. (2007).A survey of named entity recognition and clas- sification. Lingvisticae Investigationes, 30(1) , pp 3-26, 2007. [19] Soon, W. M., Ng, H. T., Lim, D. C. Y. (2001). A machine learning approach to coreference resolution of noun phrases. Computational linguistics, 27(4) ,pp. 521-544. [20] Pang, B., Lee, L. (2008).Opinion mining and sentiment analysis. Foundations and trends in information retrieval,Volume 2 ,Issue 1-2, January 2008, pp. 1- 135 . [21] Rowe, Ryan, German Creamer, Shlomo Hershkop, and Salvatore J. Stolfo. Automated social hierarchy detection through email network analysis. In Pro- ceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, pp. 109-117. ACM, 2007. [22] Liddy, E.D. 2001. Natural Language Processing. In Encyclopedia of Library and Information Science, 2nd Ed. NY. Marcel Decker, Inc. [23] Harris, Zellig S.Distributional structure.Word, Vol 10, 1954, pp. 146-162. [24] Manning, C. D.; Raghavan, P.; Schutze, H. (2008). Scoring, term weighting, and the vector space model. Introduction to Information Retrieval (PDF).pp. 52-100. [25] Powers, David M W(2011).Evaluvation:From Precision, Recall and F-measure to ROC, informedness, Markedness and Correlation.Journal of Machine Learn- ing Technologies 2(1):37-63