SlideShare a Scribd company logo
1 of 126
Download to read offline
BUSINESS EVENT RECOGNITION FROM
ONLINE NEWS ARTICLES
A Project Report
submitted by
MOHAN KASHYAP.P
in partial fulfillment of the requirements
for the award of the degree of
MASTER OF TECHNOLOGY
IN
MACHINE LEARNING AND COMPUTING
DEPARTMENT OF MATHEMATICS
INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY
Thiruvananthapuram - 695547
May 2015
i
CERTIFICATE
This is to certify that the thesis titled ’Business Event Recognition From
Online News Articles’, submitted by Mohan Kashyap.P, to the Indian Insti-
tute of Space Science and Technology, Thiruvananthapuram, for the award of the
degree of MASTER OF TECHNOLOGY, is a bonafide record of the research
work done by him under my supervision. The contents of this thesis, in full or in
parts, have not been submitted to any other Institute or University for the award
of any degree or diploma.
Dr. Sumitra.S
Supervisor
Department of Mathematics
IIST
Dr. Raju K. George
Head of Department
Department of Mathematics
IIST
Place: Thiruvananthapuram
May, 2015
ii
DECLARATION
I declare that this thesis titled ’Business Event Recognition From Online
News Articles’ submitted in fulfillment of the Degree of MASTER OF TECH-
NOLOGY is a record of original work carried out by me under the supervision
of Dr. Sumitra .S, and has not formed the basis for the award of any degree,
diploma, associateship, fellowship or other titles in this or any other Institution
or University of higher learning. In keeping with the ethical practice in reporting
scientific information, due acknowledgements have been made wherever the find-
ings of others have been cited.
Mohan Kashyap.P
SC13M055
Place: Thiruvananthapuram
May, 2015
iii
Abstract
Business Event Recognition From Online News Articles deals with the ex-
traction of news from text related to business events in three domains Acquisition,
Vendor-supplier and Job. The developed automated model for recognizing busi-
ness events would predict whether the online news article contains a business event
or not. For developing the model, the data related to business events had been
crawled. Since the manual labeling of data was expensive, semi-supervised learn-
ing techniques were used for getting required labeled data and then tagged data
had been pre-processed using techniques of natural language processing. Further
on vectorizers were applied on the text to convert it into numerics using bag-of-
words, word-embedding and word2vec approaches. In the end ensemble classifiers
with bag-of-words approach and CNN(Convolutional Neural Network) using word-
embedding, word2vec approaches were applied on the business event datasets and
the results obtained were found to be promising.
Acknowledgements
First and foremost I thank God, The Almighty, for all his blessings. I would
like to express my deepest gratitude to my research supervisor and teacher, Dr.
Sumitra .S for her continuous guidance and motivation without which this research
work would never have been possible. I cannot thank her enough for her limit-
less patience and dedication in correcting my thesis report and molding it into
its present form. Interactions with her taught me the importance of small things
which are often overlooked and an exposure to the art of approaching a problem at
different angles. These lessons will be invaluable for me in my career and personal
life ahead.
Besides my supervisor, I would like to thank my mentor, Mr.Mahesh C.R. of
TataaTsu Idea Labs for allowing me to carry my thesis work in their organization.I
would like to express my deepest gratitude for him for helping me to realize my
abilities and build confidence in me to to solve challenging problems in Machine
Learning turning my theoretical understanding into practical real time implemen-
tation.My sincere thanks also goes to all the faculty members of Mathematics
Department for their encouragement, questions and insightful comments.
I am grateful to my project lead at Tataatsu Idea labs Mr.Vinay and his
team of the Tataatsu Idea labs for helping me in implementation of project work .
I would like to appreciate Research Scholar Shiju.S.Nair for extending his
’any time’ help and thanks to him for providing additional inputs to my work.
last but not the least i would like to thank my classmates and friends in IIST
for their company and for all the fun we had during the two years of M.Tech.Hailing
from Electrical Background not that great in coding special thanks goes to Praveen
and Sailesh for constantly supporting me and guiding through for two years in ma-
chine learning and Arvindh too for inspiring me in certain regards of the course
iv
v
work.
Last but not the least, I would like to thank my parents and my sister for
their care, love and support throughout my life.
Contents
Acknowledgements iv
List of Figures vi
List of Tables ix
List of Abbreviations xii
1 Introduction 1
1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . 3
Information Extraction and Retrieval: . . . . . . . . . 4
Named Entity Recognition: . . . . . . . . . . . . . . . 4
Parts Of Speech Tagging: . . . . . . . . . . . . . . . . 4
1.2.2 Text to Numeric Conversion . . . . . . . . . . . . . . . . . . 4
1.2.3 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3.1 Semi-supervised Technique . . . . . . . . . . . . . . 5
1.2.3.2 Active Learning . . . . . . . . . . . . . . . . . . . . 6
Uncertainty sampling: . . . . . . . . . . . . . . . . . . 6
Query by the committee: . . . . . . . . . . . . . . . . 6
Expected model change: . . . . . . . . . . . . . . . . 7
Expected error reduction: . . . . . . . . . . . . . . . . 7
Variance reduction: . . . . . . . . . . . . . . . . . . . 7
1.2.4 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.4.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . 7
Bagging: . . . . . . . . . . . . . . . . . . . . . . . . . 8
Boosting: . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . 8
Convolutional Layer: . . . . . . . . . . . . . . . . . . 8
Activation Function: . . . . . . . . . . . . . . . . . . 9
Pooling layer: . . . . . . . . . . . . . . . . . . . . . . 9
Fully connected layer: . . . . . . . . . . . . . . . . . . 9
Loss layer: . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2.6 Measures used for Analysing the Results: . . . . . . . . . . . 9
i
Contents ii
1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Part-of-speech (POS) pattern of the phrase: . . . . . 11
Extraction of rhetorical signal features: . . . . . . . . 11
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Second Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Third Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4.3 Fourth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Fifth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Data Extraction,Data pre-processing and Feature Engineering 14
2.1 Crawling of Data from Web . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Labeling of Extracted Data . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1.1 Acquisition Data Description . . . . . . . . . . . . 15
Acquisition event: . . . . . . . . . . . . . . . . . . . . 15
Non Acquisition event: . . . . . . . . . . . . . . . . . 15
2.2.1.2 Vendor-Supplier Data Description . . . . . . . . . . 15
Vendor-Supplier event: . . . . . . . . . . . . . . . . . 15
Non Vendor-Supplier event: . . . . . . . . . . . . . . 16
2.2.1.3 Job Data Description . . . . . . . . . . . . . . . . . 16
Job event: . . . . . . . . . . . . . . . . . . . . . . . . 16
Non Job event: . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Type 1 Features . . . . . . . . . . . . . . . . . . . . . . . . . 17
Noun, Noun-phrases and Proper nouns: . . . . . . . . 17
Example of Noun-phrase: . . . . . . . . . . . . . . . . 17
Word-Capital: . . . . . . . . . . . . . . . . . . . . . . 17
Example of Capital words: . . . . . . . . . . . . . . . 17
Parts of speech tag pattern: . . . . . . . . . . . . . . 17
Example of POS tag pattern Adj-Noun format: . . . 18
2.3.2 Type 2 Features . . . . . . . . . . . . . . . . . . . . . . . . . 18
Organization Name: . . . . . . . . . . . . . . . . . . . 18
Example of Organization names: . . . . . . . . . . . . 18
Organization references: . . . . . . . . . . . . . . . . 18
Examples of Organization references: . . . . . . . . . 18
Location: . . . . . . . . . . . . . . . . . . . . . . . . . 18
Example of location as feature . . . . . . . . . . . . . 18
Persons: . . . . . . . . . . . . . . . . . . . . . . . . . 18
Example of Persons: . . . . . . . . . . . . . . . . . . . 18
2.3.3 Type 3 Features . . . . . . . . . . . . . . . . . . . . . . . . . 19
Continuation: . . . . . . . . . . . . . . . . . . . . . . 19
Change of direction: . . . . . . . . . . . . . . . . . . . 19
Contents iii
Sequence: . . . . . . . . . . . . . . . . . . . . . . . . 19
Illustration: . . . . . . . . . . . . . . . . . . . . . . . 19
Emphasis: . . . . . . . . . . . . . . . . . . . . . . . . 19
Cause, condition or result : . . . . . . . . . . . . . . . 19
Spatial signals: . . . . . . . . . . . . . . . . . . . . . 19
Comparison or contrast: . . . . . . . . . . . . . . . . 19
Conclusion: . . . . . . . . . . . . . . . . . . . . . . . 19
Fuzz: . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Description of Vectorizers . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1 Count Vectorizers . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4.1.1 Example of Count Vectorizer . . . . . . . . . . . . 20
2.4.2 Term Frequency and Inverse Document Frequency . . . . . . 21
2.4.2.1 Formulation of Term Frequency and Inverse Doc-
ument Frequency . . . . . . . . . . . . . . . . . . . 21
Term-Frequency formulation: . . . . . . . . . . . . . . 21
Inverse Document Frequency formulation: . . . . . . . 21
2.4.2.2 Description of Combination of TF and IDF . . . . 22
2.4.2.3 Example of TF-IDF Vectorizer . . . . . . . . . . . 22
3 Machine Learning Algorithms Used For Analysis Of Business
Event Recognition 24
3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation-
Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Active Learning using Ensemble classifiers with QBC approach . . . 25
3.2.1 Query by committee . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Ensemble Models for Classification of Business Events using Bag-
Of-Words Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3.1 Gradient Boosting Classifier . . . . . . . . . . . . . . . . . . 26
3.3.2 AdaBoost Classifier . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.3 Random Forest Classifiers . . . . . . . . . . . . . . . . . . . 29
3.4 Multilayer Feed Forward with Back Propagation using word em-
bedding approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.5 Convolutional Neural Networks for Sentence Classification with un-
supervised feature vector learning . . . . . . . . . . . . . . . . . . . 30
3.5.1 Variations in CNN sentence models . . . . . . . . . . . . . . 32
CNN-rand: . . . . . . . . . . . . . . . . . . . . . . . . 32
CNN-static: . . . . . . . . . . . . . . . . . . . . . . . 32
4 Results and Discussions 34
4.1 Semi-supervised Learning Implementation using Naive Bayes with
Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.1 Results and Analysis of Vendor-Supplier Event Data . . . . 35
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Results and Analysis for Job Event Data . . . . . . . . . . . 40
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 40
Contents iv
4.1.3 Result and Analysis for Acquisition Event Data . . . . . . . 45
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Active Learning implementation by Query by committee approach . 50
4.2.1 Results and Analysis for Vendor-Supplier Event Data . . . . 50
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.2 Result and Analysis for Job Event Data . . . . . . . . . . . 55
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.3 Result and Analysis for Acquisition Event Data . . . . . . . 60
Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.3 Comparison of Semi-supervised techniques and Active learning ap-
proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.4 Results of Ensemble Classifiers with different Parameter tuning . . 65
4.4.1 Analysis for vendor-supplier event Data using 100 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 65
4.4.2 Analysis for Job event Data using 100 estimators within the
ensemble as the parameter . . . . . . . . . . . . . . . . . . . 68
4.4.3 Analysis for Acquisition event Data using 100 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 71
4.4.4 Analysis for Vendor-Supplier event Data using 500 estima-
tors within the ensemble as the parameter . . . . . . . . . . 74
4.4.5 Analysis for Job event Data using 500 estimators within the
ensemble as the parameter . . . . . . . . . . . . . . . . . . . 77
4.4.6 Analysis for Acquisition event Data using 500 estimators
within the ensemble as the parameter . . . . . . . . . . . . . 80
4.5 Final Accuracies and F-score estimates for the model . . . . . . . . 83
4.5.1 Final Analysis of Vendor-Supplier Dataset . . . . . . . . . . 84
4.5.2 Final Analysis of Job Dataset . . . . . . . . . . . . . . . . . 85
4.5.3 Final Analysis of Acquisition Dataset . . . . . . . . . . . . . 87
4.6 Results obtained for MFN with Word Embedding . . . . . . . . . . 90
4.7 Results obtained for Convolutional Neural Networks . . . . . . . . . 90
4.7.1 Analysis for Vendor-Supplier Data using CNN-rand and CNN-
word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.2 Analysis for Acquisition Data using CNN-rand and CNN-
word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.7.3 Analysis for Job using CNN-rand and CNN-word2vec Model 92
4.8 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5 Conclusions and Future work 95
5.1 Challenges Encountered in Business Event Recognition . . . . . . . 95
5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Contents v
Bibliography 99
List of Figures
3.1 The Image describes the architecture for Convolutional Neural Net-
work with Sentence Modelling for multichannel architecture . . . . 31
4.1 Variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . 36
4.2 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . 37
4.3 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 37
4.4 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . 38
4.5 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . 39
4.7 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 39
4.8 Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . . . 41
4.9 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . 42
4.10 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 42
4.11 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . 43
4.12 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 43
4.13 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . 44
4.14 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 44
4.15 Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . 46
4.16 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 47
4.17 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 47
vii
List of Figures viii
4.18 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 48
4.19 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 48
4.20 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition . . . . . . . . . . . . . . 49
4.21 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition . . . . . . . . . . . . . . . . . . . 49
4.22 variations in Accuracies and F1-scores for Vendor-supplier data us-
ing Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.23 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier . . . . . . . . . . . . 52
4.24 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier . . . . . . . . . . . . . . . . 52
4.25 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier . . . . . . . . . . . 53
4.26 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier . . . . . . . . . . . . . . . . 53
4.27 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier . . . . . . . . . . . . 54
4.28 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier . . . . . . . . . . . . . . . . 54
4.29 Variations in Accuracies and F1-scores for Job event data using
Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.30 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job . . . . . . . . . . . . . . . . . . 57
4.31 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 57
4.32 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job . . . . . . . . . . . . . . . . . . 58
4.33 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 58
4.34 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 59
4.35 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 59
4.36 Variations in Accuracies and F1-scores for Acquisition event data
using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.37 Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition . . . . . . . . . . . . . . 62
4.38 Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 62
4.39 Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition . . . . . . . . . . . . . . 63
List of Figures ix
4.40 Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 63
4.41 Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 64
4.42 Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 64
4.43 variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 67
4.44 Confusion matrix for Vendor-supplier with number of estimators as
100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.45 Roc curve for for Vendor-supplier with number of estimators as 100 68
4.46 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.47 Confusion matrix for Job with number of estimators as 100 . . . . . 71
4.48 Roc curve for for Job with number of estimators as 100 . . . . . . . 71
4.49 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 73
4.50 Confusion matrix for Acquisition with number of estimators as 100 74
4.51 Roc curve for for Acquisition with number of estimators as 100 . . . 74
4.52 Variations in Accuracies and F1-scores for Vendor-supplier data for
5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 76
4.53 Confusion matrix for Vendor-supplier with number of estimators as
500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.54 Roc curve for for Vendor-supplier with number of estimators as 500 77
4.55 Variations in Accuracies and F1-scores for Job data for 5-fold using
3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.56 Confusion matrix for Job with number of estimators as 500 . . . . . 80
4.57 Roc curve for for Job with number of estimators as 500 . . . . . . . 80
4.58 Variations in Accuracies and F1-scores for Acquisition data for 5-
fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 82
4.59 Confusion matrix for Acquisition with number of estimators as 500 83
4.60 Roc curve for for Acquisition with number of estimators as 500 . . . 83
4.61 Variations in Accuracies and F1-scores for Vendor-supplier data for
whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.62 Variations in Accuracies and F1-scores for Job data for whole data
set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.63 Variations in Accuracies and F1-scores for Acquisition data 5-folds
accuracy variations for whole data set . . . . . . . . . . . . . . . . . 89
4.64 CNN-rand and CNN-word2vec models for Vendor-supplier on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.65 CNN-rand and CNN-word2vec models for Acquisition on whole
data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.66 CNN-rand and CNN-word2vec models for Job on whole data set
with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
List of Tables
1.1 Recognition of Named-Event Passages in News Articles and its ap-
plication to our work . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1 The words and their counts in the sentence1 . . . . . . . . . . . . . 22
2.2 The words and their counts in the sentence2 . . . . . . . . . . . . . 22
4.1 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data . . . . . . . . . . . . . . 36
4.2 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data . . . . . . . . . . . . . . . . . 41
4.3 Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data . . . . . . . . . . . . . 46
4.4 Variation in accuracies and F-scores using Active Learning for Vendor-
supplier event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.5 Variation in accuracies and F-scores using Active Learning for Job
event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.6 Variation in accuracies and F-scores using Active Learning for Ac-
quisition event data . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.7 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in vendor-supplier data set 66
4.8 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.9 Variation in accuracies and F-scores for random forest classifier for
number of parameter estimate as 100 in vendor-supplier data . . . . 66
4.10 Variation in test score for accuracy and F-score with Vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.11 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Job data set . . . . . . 69
4.12 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.13 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Job data set . . . . . . . . 69
4.14 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 100 70
4.15 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 100 in Acquisition data set . . 72
xi
List of Tables xii
4.16 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.17 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 100 in Acquisition data set . . . . 72
4.18 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.19 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in vendor-supplier data set 75
4.20 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.21 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in vendor-supplier data set . . 75
4.22 Variation in test score for accuracy and F-score with vendor-supplier
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.23 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Job data set . . . . . . 78
4.24 Variation in accuracies and F-scores for Ada Boosting classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.25 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Job data set . . . . . . . . 78
4.26 variation in test score for accuracy and F-score with Job data using
voting of three ensemble classifiers with number of estimators as 500 79
4.27 Variation in accuracies and F-scores for Gradient Boosting classifier
for number of parameter estimate as 500 in Acquisition data set . . 81
4.28 Variation in accuracies and F-scores for Random forest classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.29 Variation in accuracies and F-scores for Ada boosting classifier for
number of parameter estimate as 500 in Acquisition data set . . . . 81
4.30 Variation in test score for accuracy and F-score with Acquisition
data using voting of three ensemble classifiers with number of esti-
mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.31 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . 84
4.32 Variation in accuracies and F-scores for Ada Boosting classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 84
4.33 Variation in accuracies and F-scores for Random forest classifier for
whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 85
4.34 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.35 Variation in accuracies and F-scores for Ada Boosting classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.36 Variation in accuracies and F-scores for Random forest classifier for
whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
List of Tables xiii
4.37 Variation in accuracies and F-scores for Gradient Boosting classifier
for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . 88
4.38 Variation in accuracies and F-scores for Random forest classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 88
4.39 Variation in accuracies and F-scores for Ada boosting classifier for
whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 89
4.40 Variation in test score for MFN with word embedding . . . . . . . . 90
4.41 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Vendor-supplier on whole data set . . . . . . . . . . . . . 91
4.42 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Acquisition on whole data set . . . . . . . . . . . . . . . 92
4.43 Variation in accuracies and F-scores CNN-rand and CNN-word2vec
models for Job on whole data set . . . . . . . . . . . . . . . . . . . 93
List of Abbreviations
POS Parts of speech
NLTK Natural Language Tool Kit
QBC Query By Committee
NLP Natural language processing
IE Information Extraction
IR Information Retrieval
NER Named entity recognizer
ML Machine Learning
CNN Convolutional Neural network
MFN Multilayer feed forward network
TF Term Frequency
IDF Inverse Document Frequency
CBOW Continuous bag of words
ROC Receiver operator characteristic
TPR True Positive Rate
FPR False Positive Rate
TP True Positives
FP False Positives
TN True Negatives
FN False Negatives
xv
Chapter 1
Introduction
Textual information present in the web is unstructured and extracting useful infor-
mation from it for a specific purpose is tedious and challenging. So over the years
various methods have been proposed for extraction of useful text. Text mining is
the domain that deals with the process of deriving high-quality information from
unstructured text. The goal of text mining is essentially to convert unstructured
text into structured data and there by extracting some useful information by ap-
plying techniques of natural language processing (NLP) and pattern recognition.
The concept of manual text mining was first introduced in mid-1980’s (Hobbs
et al., 1982). Over the past decade technological advancements in this field have
been significant with building of automated approaches for extraction and analysis
of text. Text mining is composed of five major components: information retrieval,
data mining, machine learning, statistics and computational linguistics.
The application of text mining are in the various domains which includes:
(a) Named entity recognition which deals with identification of named text features
such as people, organization and location(sang et al., 2003). (b) Recognition of
pattern identified entities which deals with extraction of features such as telephone
numbers, e-mail address and built-in database quantities that can be discerned
using regular expression or other pattern matches(Nadeau et al., 2007). (c) Co-
reference deals with the identification of noun phrases and other terms that refer
to these nouns eg: such as her, him, it and their(Soon et al., 2001). (d) Sentiment
analysis which includes extracting various forms of users intent information such
1
2
as sentiment, opinion, mood and emotion. Text analytics techniques are helpful
in the analysis of sentiment at different topics level(pang et al., 2008). (e) Spam
detection which deals with the classification of e-mail as spam or not, based on
application of statistical machine learning and text mining techniques.(Rowe et al.,
2007) (f) News analytics which deals with extraction of vital news or information
content of an interest to the end user. (h) Business event recognition from online
news articles.
Business Event Recognition From Online News Articles captures
semantic signals and identifies pattern from unstructured text to extract business
events in three main domains i.e. acquisition, vendor-supplier and job events
from online news articles. Acquisition business event news pattern in general is
of the context organization acquiring another organization. The keywords used in
acquisition business events scenario are acquire, buy, sell, sold, bought, take-over,
purchase and merger. Vendor-supplier business event news pattern in general
is of the context organization obtaining a contract from another organization to
perform certain task for that organization. The keywords used in vendor-supplier
business event scenario are contract, procure, sign, implement, select, award, work,
agreement, deploy, provide, team, collaborate, deliver and joint. Job business event
news pattern in general is of the context of appointments of persons to prominent
positions, hiring and firing of people within an organizations.
Our thesis deals with the development of an automated model for busi-
ness event recognition from online news articles. For developing the automated
model of Business Event Recognition From Online News Articles, data has been
crawled from different websites such as reutersnews, businesswireindia.com and
prnewswire.com. Since the manual labeling of the data was expensive, the gath-
ered data was subjected to semi-supervised learning techniques and active learn-
ing methods for getting more tagged event data in the domains of acquisition,
vendor-supplier and job. Then the obtained tagged data was pre-processed using
natural language processing techniques. Further on, for the conversion of text
to numerics the bag-of-words, word-embedding and word2vec approaches were
3
used. Final analysis on the business event dataset was performed using ensem-
ble classifiers with bag-of-words approach and convolutional neural network with
word-embedding, word2vec approach.
1.1 Model Architecture
Given a set of online articles or documents which is of interest from the end user,
our developed automated model must predict the class output as whether the
given sentence contains business event related to acquisition, vendor-supplier and
job events.
If the automated model predicts a sentence as a business event then it has
to give out additional information regarding the description of the event such as
entities involved in that particular event like organizations and people. Provid-
ing such additional information helps the end user to make better decisions with
quicker insights.
On daily basis around the world business events are happening. An orga-
nization as a competitor would like to understand the business analytics of other
organizations. The development of an automated approach for identifying such
business events helps in better decision making, increases efficiency and helps to
develop better business strategies for that organization.
1.2 Methods
Given below sections are the methods used for our work.
1.2.1 Natural Language Processing
The concept of information extraction and information retrieval in our work deals
with extraction and retrieval of the business news containing the business event
sentences from the online news articles. The concepts of part-of-speech (POS)
tagging and named entity recognition (NER) are used as part of feature engi-
neering in our work. The pattern of POS tagging is essential in extracting useful
4
semantic features and NER is useful in extracting entity type features like or-
ganizations, persons and location which form the integral part of any business
event. The framework for our project is formed by the concepts of information
extraction (IE) and information retrieval (IR). Discussed below are the methods of
information extraction and retrieval, named entity recognition(NER) and parts-
of-speech(POS) tagging which forms the baseline for implementation of natural
language processing techniques(Liddy, 2001).
Information Extraction and Retrieval: Information extraction and retrieval
deals with searching of required text, extraction of semantic information from text
and storing of retrieved information in a particular form in the database.
Named Entity Recognition: Named entity recognition deals with extraction
from a document of text a set of people or places and type based entities which
include organizations.
Parts Of Speech Tagging: The pattern of POS tagging forms an important
set of features for any NLP related task. Extraction of proper semantic features
is possible with the pattern of POS tags.
1.2.2 Text to Numeric Conversion
The conversion of word to vectors was implemented using the bag-of-words and
word embedding approach. Described below is the overview of these concepts.
In bag-of-words approach piece of text or sentence of a document is repre-
sented as the bag(multiset) of words disregarding the grammar and the word order,
but keeping the multiplicity of the words intact(Harris, 1954). Word embedding
is the collective name for a set of language modeling and feature learning tech-
niques in natural language processing where words from the sentences are mapped
to vectors of real numbers in a low dimensional space, relative to the vocabulary
size(Tomas Mikilov et al.,2013).
One of the major disadvantages of bag-of-words approach is it fails to capture
the semantics of a particular word within in a sentence, because it converts words to
5
vectors disregarding the grammar and the order. Consider the following sentences
where the bag-of-words approach fails.
After drawing money from the Bank Ravi went to the river Bank.
In the bag-of-words approach there is no distinction between financial word Bank
and river Bank. This problem of capturing semantics of the word to a certain
extent overcome by word-embedding. In word embedding each word is represented
by a 100 to 300 dimensional uniformly distributed (i.e U[-1,1])random dense vector.
Word-embedding with window approach captures semantics to certain extent.
1.2.3 Data Labeling
The extracted data labeled in supervised manner were few in number. The sections
below describe the semi-supervised technique and active learning methods.
1.2.3.1 Semi-supervised Technique
The naive Bayes classifier forms the integral part in the implementation of semi-
supervised learning using naive Bayes classifier with expectation maximization to
increase the number of labeled data points (kamalNigam et al.,2006). Discussed
below is an overview of the naive Bayes classifier.
Naive Bayes classifiers are probabilistic classifiers which use the concept of Bayes
theorem. In naive Bayes classifier assumption is made that one feature is condi-
tionally independent from another feature. The modeling of a naive Bayes classifier
is described as follows:
Given a input feature vector x=(x1, x2, ...., xn)T
we need to calculate which class
does this feature vector belong to i.e. p(Yk|x1, x2, ...xn) for each k classes, where
Yk is the output variable for the kth class. Now using the concept of the Bayes
theorem we can rewrite the above probability expression as:
p(Yk|x) = p(Yk)p(x|Yk)
p(x)
where
p(Yk) = are prior probabilities for that particular class
p(x|Yk) = is the maximum likelihood estimator
6
p(x) = is the probability of choosing that particular data point
The naive Bayes classifier framework uses the maximum posteriori rule, to pick
the output which is most probable output for that particular class. Maximum
posteriori probabilities = Prior× Maximum likelihood
Naive Bayes classifier assigns a label ˆy = Yk based on MAP rule, and classifier
prediction is given as follows.
ˆy = argmax
k∈{1,...,K}
p(Yk)
n
i=1
p(xi|Yk).
In text mining the classifier used is Multinomial Naive Bayes classifier with the
bag-of-words approach.
1.2.3.2 Active Learning
Active learning with query by committee approach using ensemble classifiers was
implemented as part of our work to increase the number of labeled data(Abe and
Mamitsuka, 1998). Discussed below is the concept of active learning.
Active learning is a special case of semi-supervised machine learning in which
a learning algorithm is able to interactively query the user (or some other informa-
tion source) to obtain the desired outputs at new data points. There are situations
in which unlabeled data is abundant but manually labeling is expensive. In such
a scenario, learning algorithms can actively query the user for labels. This type
of iterative supervised learning is called active learning. Since the learner chooses
the examples, the number of examples to learn a concept can often be much lower
than the number required in normal supervised learning. The following discussed
are query strategies for querying most informative data points in active learning.
Uncertainty sampling: Uncertainty Sampling deals with labeling of those points
for which current model is least certain about or for which labeled data point en-
tropy value is maximum, by querying with the user.
Query by the committee: A combination of classifiers are trained on the
current labeled data points. Finally take the vote on the predicted labels of the
7
classifiers and query the labels by the user for labels which the classifiers disagree
the most.
Expected model change: Labeling of the data points which would result in
drastic change in the current model.
Expected error reduction: Labeling of those points which would reduce the
most current model’s generalization error.
Variance reduction: Labeling of those points which minimizes the output vari-
ance of the current model the most, which are the points near by to the marginal
hyper-plane in SVM.
1.2.4 Learning Classifiers
The classifiers used in our work were ensemble classifiers and convolutional neural
networks(CNN). The sections below describe the basic overview of concepts that
are required to understand the ensembles methods and CNN that was implemented
as in our work.
1.2.4.1 Ensemble Classifiers
Random forest classifier implemented in our work (Breiman, 2001) is derived
from the concept of bootstrap aggregation technique. Gradient boosting classifier
(Friedman et al., 2001) and ada boost classifier (Freund et al., 1995) implemented
in our work are derived from the boosting algorithm technique. Discussed below
are concepts of ensembles with bagging and boosting.
Ensembles are the concept of combining classifiers by which the performance of
the combined classifier on the model is increased compared to the performance
of each individual classifier on the model. There two different kinds of ensemble
methods in practice, one is bagging also called as bootstrap aggregation and the
other method is boosting.
8
Bagging: In bagging, from a subset of training data at each instance a single
classifier is learnt. From a training set M, then its possible to draw M random
instances using a uniform distribution. These M samples drawn at each instant
can be learned using a classifier and this process is repeated several times. Since
the sampling drawn is with replacement there are chances that certain data points
get picked up twice and certain data points don’t, within the subset of the original
training dataset. A classifier is learnt using these subsets of training data set
for each cycle. Final prediction is based on taking vote of classifier for different
generated datasets.
Boosting: In boosting using a subset of data at each instance a single classifier
or different classifiers are learnt. Boosting technique analyses the performance of
learnt classifier in each instant and forces the classifier to learn those training sam-
ple instances which was incorrectly classified by the classifier. Instead of choosing
the M training instances randomly using a uniform distribution, one chooses the
training instances in such a manner as to favour the instances that have not been
accurately learned by the classifier. The final prediction is performed by taking
the weighted vote of each classifier learnt various different instances.
1.2.5 Convolutional Neural Network
Convolutional neural networks for sentence modelling, trained on softmax-classifier
was implemented in our work(Yoon kin, 2014). Discussed below is the overview
of a generalized convolutional neural network and softmax-classifier.
Convolutional neural network is a type of feed-forward neural network whose archi-
tecture consists of three main layers which are convolutional layer, pooling layer,
fully-connected layer and loss layer. The stacking of these layers forms the full
conv-net architecture.
Convolutional Layer: In conventional convolutional operation of sobel, pewitt
filters on the image is useful in detecting the features of the image such as edge,
9
corners etc., in comparison the convolutional neural net, parameters of each convo-
lutional kernel i.e. (the each filter) is trained by the back-propagation algorithm.
There are many convolution kernels in each layer, and each kernel is replicated
over the entire image with the same parameters. The function of the convolution
operators is to extract different features of the input.
Activation Function: Activation function used in convolutional neural net-
works are hyperbolic-tangent function f(x) = tanh(x), RELU function f(x) =
max(0, x) and sigmoid function f(x) = 1
1+exp(−x)
.
Pooling layer: This set of layer captures the most important feature by per-
forming the max operation on the obtained feature map vector. All such max
features obtained form the penultimate layer.
Fully connected layer: Finally, after several convolutional and max pooling
layers, the high-level reasoning in the neural network is done via fully connected
layers. A fully connected layer takes all neurons in the previous layer (be it fully
connected, pooling, or convolutional) and connects it to every single neuron it has.
Fully connected layers are not spatially located anymore (you can visualize them
as one-dimensional), so there can be no convolutional layers after a fully connected
layer.
Loss layer: From fully connected layer, a soft-max classifier is present at the
output layer with a soft-max loss function, to predict the probabilistic labels.
Soft-max classifier is a classifier obtained from the soft-max function, for a
sample input vector x, the predicted probability output y for the jth class among
K classes is given as:
P(y = j|x) = exTwj
K
k=1 exTwk
1.2.6 Measures used for Analysing the Results:
The performance measures used for our results and analysis are as described as
follows(Powers et al., 2007)
10
1. F-score: F-score is a kind of measure used in information retrieval for mea-
suring the sentence classification performance, since it takes only the true
positives into account and not true negatives while calculating the measure.
The F-score is described as:
F1 = 2×TP
2×TP+FP+FN
2. Confusion matrix: The performance of any classification algorithm can be
visualized by a specific table layout which is called the confusion matrix.
Each column of the confusion matrix represents the instances in a predicted
class, while each row of the confusion matrix represents the instances in an
actual class.
3. ROC curve: It is a plot between TPR and FPR. The TPR desrcibes about
the number of true positive results present in total positive samples. FPR
describes about the set of incorrect positive results present in total negative
samples. Area under the ROC curve is a measure of accuracy.
4. Accuracy: Accuracy of a classification problem is defined as:
accuracy = TP+TN
P+N
1.3 Related Works
The paper which is close our work of Business Event Recognition From Online
News Articles is Recognition of Named-Event Passages in News Articles (Luis-
Marujo et al., 2012). This paper describes about the method for finding named
events in violent behaviour domain and business domain, in the specific passages
of news articles that contain information about such events and report their pre-
liminary evaluation results using techniques of NLP and ML algorithms. The
following table (1.1) describes about the paper Recognition of Named-Event Pas-
sages in News Articles and its application to our work.
As part of feature engineering used in our work, we have used some of the feature
engineering techniques as in (LuisMarujo et al., 2012). The following are the
features extracted used in our work as with reference to this paper.
11
Part-of-speech (POS) pattern of the phrase: (e.g., < noun >, < adj, noun >
, < adj, adj, noun >, etc.) Noun and noun phrases are the most common pattern
observed in key phrases containing named events, verb and verb phrases are less
frequent, and key phrases made of the remaining POS tags are rare.
Extraction of rhetorical signal features: These are set of features which
capture the readers attention in News Events which are continuation, change of
direction, sequence, illustration, emphasis, cause, condition, result, spatial-signals,
comparison/contrast, conclusion and fuzz.
1.4 Thesis Outline
Second chapter deals with the extraction and understanding of business event
data, third chapter deals with the application of machine-learning algorithms on
obtained data, fourth chapter deals with results and analysis on the business event
datasets and finally fifth chapter deals with conclusion of our work.
1.4.1 Second Chapter
This chapter deals with extraction of business event data from web, followed by
pre-processing of the data. Application of feature engineering on the obtained
data and finally converting the data into vectors for applying machine-learning
algorithms.
1.4.2 Third Chapter
This chapter deals with applying semi-supervised techniques on the data to in-
crease the number of data points and understanding of algorithms of different
ensemble classifiers and CNN(convolutional neural network).
12
Table 1.1: Recognition of Named-Event Passages in News Articles and its
application to our work
Recognition of named-event passages
from news articles
Business event recognition from online
news articles
1.Deals with the automatically identi-
fying multi-sentence passages in a news
article that describe named events.
Specifically this paper focuses on ten
event types, five are in the violent
behavior domain: terrorism, suicide
bombing, sex abuse, armed clashes,
and street protests. The other five
are in the business domain: man-
agement changes, mergers and acqui-
sitions, strikes, legal troubles, and
bankruptcy.
1.Our work derived as part of Recog-
nition of Named-Event Passages from
News articles focuses exclusively of
identifying the business events in the
domains of merger and acquisition,
vendor-supplier and job events.
2.The problem is solved as multiclass
classification problem for which the
training data was obtained as part of
crowd-sourcing using amazon mechan-
ical turk to label the data points as
events or not events. Then using en-
semble classifiers for the classification
of these sentences for each event. Fi-
nally aggregating passages containing
the same events using HMM methods.
2.The problem in our case is solved as
a binary classification for the three do-
mains merger and acquisition, vendor-
supplier and job, describing as, that
particular event or not. The proce-
dure used in our case varies as we label
few data points in the supervised way
and then by applying semi-supervised
techniques we increase the number of
labeled data points. Finally applying
ensemble classifiers and convolutional
neural networks for classification of the
labeled data points.
13
1.4.3 Fourth Chapter
This chapter deals with results and analysis of applied machine-learning techniques
which includes semi-supervised learning analysis, ensemble classifier analysis and
analysis of convolutional neural networks.
1.4.4 Fifth Chapter
This chapter deals with challenges encountered while performing the project, con-
clusion of the project and future scope of the project.
1.5 Thesis Contribution
Our work focuses on business event recognition in three domains: acquisition,
vendor-supplier and job. This whole process of identifying the business event
news exclusively in these three domains using the knowledge of machine learning
and NLP techniques is main contribution of our work.
Chapter 2
Data Extraction,Data
pre-processing and Feature
Engineering
Initial step in Business Event Recognition is business news extraction and labeling
few of the extracted data, so that it can be formulated as a machine learning
problem. The method of data extraction from web and labeling some of the
extracted data is described in the following section.
2.1 Crawling of Data from Web
There are several methods to crawl the data from web. One of such methods
is described in this section. Every website has its own HTML logic. So sepa-
rate crawling logic had to be written to extract text data from different websites.
Modules used for data extraction in python are Beautiful-soup and Urllib. For ex-
traction of the data for our study, information is extracted from particular websites
such as businesswireindia.com, prnewswire.com and reuters news.
Python language frame work used in our work. Urllib module in python is
used get particular set of pages which has to be accessed within the web. Beautiful-
soup module in python uses the HTML logic and finds the contents present within
that page in the format of the title, subtitle and the description corresponding
15
16
to each content block by block. Finally the extracted title, subtitle and body
contents are stored in the text-file formats.
2.2 Labeling of Extracted Data
Since the business events are in the form of sentences, the text document obtained
as raw text as part of web crawling, is split up into sentences using a natural
language processing toolkit(NLTK) sentence tokenizer. Some of the sentences
were labeled into three classes: merger and acquisition, vendor-supplier and job
describing whether it is a business event or not.
2.2.1 Data Description
Stated below is an illustration of data describing business event or not a business
event in three classes of acquisition, vendor-supplier and job.
2.2.1.1 Acquisition Data Description
Acquisition event: ARMONK, N.Y., April 10, 2014 /PRNewswire/ – IBM
(NYSE: IBM) today announced a definitive agreement to acquire Silverpop, a
privately held software company based in Atlanta, GA.
Non Acquisition event: : Carlyle invests across four segments Corporate Pri-
vate Equity Real Assets Global Market Strategies and Solutions in Africa Asia
Australia Europe the Middle East North America and South America.
2.2.1.2 Vendor-Supplier Data Description
Vendor-Supplier event: : Tri-State signs agreement with NextEra Energy Re-
sources for new wind facility in eastern Colorado under the Director Jack stone;
WESTMINSTER, Colo., Feb. 5, 2014 /PRNewswire/ – Tri-State Generation and
Transmission Association, Inc. announced that it has entered into a 25-year agree-
ment with a subsidiary of NextEra Energy Resources, LLC for a 150 megawatt
17
wind power generating facility to be constructed in eastern Colorado,in the ser-
vice territory of Tri-State member cooperative K. C. Electric Association (Hugo,
Colo.).
Non Vendor-Supplier event: The implementation of the DebMed GMS elec-
tronic hand hygiene monitoring system is a clear demonstration of Meadows Re-
gional Medical Center’s commitment to patient safety, and we are excited to
partner with such a forward-thinking organization that is focused on providing
a state-of-the-art patient environment, said Heather McLarney, vice president of
marketing, DebMed.
2.2.1.3 Job Data Description
Job event: In a note to investors, analysts at FBR Capital Markets said the
appointment of Nadella as Director of the company was a ”safe pick” compared
to choosing an outsider.
Non Job event: This partnership is an example of steps we are taking to sim-
plify and improve the Tactile Medical order process, said Cathy Gendreau,Business
Director.
2.2.2 Data Pre-processing
The extracted business event sentences as raw text as part of data extraction was
cleansed by removing of special characters and stop-words which include words
like the, and, an etc. The stopwords are common between positive class and the
negative class, and hence to enhance the difference between positive class and
negative class we had to remove them. NLTK module in python was used for the
above pre-processing of the data.
2.3 Feature Engineering
To build hand crafted features, we had to observe the extracted unstructured data
and recognize pattern, so that useful features could be extracted. The features
18
extracted are described below and examples for the corresponding features are
taken with reference to vendor-supplier event in (2.2.1.2).
2.3.1 Type 1 Features
Shallow semantic features- records the pattern and semantics of the data, which
consists of the following features (Luismaurijo et al.,2012).
Noun, Noun-phrases and Proper nouns: Entities form an integral part of
business event sentences, so noun phrases and proper nouns are common in sen-
tences containing business events. Using NLTK-parts of speech tagger from the
sentence noun phrase was extracted, correspondingly nouns and proper-nouns.
Example of Noun-phrase: Title agreement Next Era Energy wind facility
eastern Colorado WESTMINSTER Colo. Feb. Generation Transmission Associa-
tion Inc. agreement subsidiary NextEra Energy LLC megawatt wind power facility
eastern Colorado service territory member K. C. Electric Association Hugo Colo.
Word-Capital: If there is a capital letter present in sentence containing the
business event, there is a higher chance of organizations, locations and persons be-
ing present in the sentence, which inturn are entity kind of features which enhances
the event recognition.
Example of Capital words: WESTMINSTER LLC, K.C..Here WESTMIN-
STER is an Location and K.C. is an Organization, an illustration of Entity features
obtained from Capital-Word as feature.
Parts of speech tag pattern: Pattern of parts of speech tags adjective-noun,
i.e noun followed by adjective, adjective-adjective-noun, i.e noun followed by two
adjectives are good sets of features in event recognition. Adjectives are used in
scenarios to describe a noun, so there is higher chance of finding this kind of
scenario in business event sentence. Noun and noun phrases are the most common
19
pattern observed in key phrases of business event sentence, verb and verb phrases
are less frequent and key phrases made of the remaining POS tags are rare.
Example of POS tag pattern Adj-Noun format: new wind 25-year agree-
ment Tri-State member, here adjective is agreement and noun is Tri-State member.
2.3.2 Type 2 Features
Entity type features: To capture the entities present in the business event sentence.
Following described are some of the features.
Organization Name: Organizations names are usually present in sentences
containing business events, which often give additional insights as features in event
recognition.
Example of Organization names: Tri-state Tri-State Generation and Trans-
mission Association, NextEra Energy Resources.
Organization references: Referencing organization entities present in the busi-
ness event sentences are taken as features.
Examples of Organization references: K. C. Electric Association
Location: Location is an important entity describing feature giving more insight
to description of business events.
Example of location as feature : WESTMINSTER Colo. Colorado
Persons: Their is a higher chance of person or a group of people being present
in the sentences that contain business events, so persons are used as features to
enhance business event recognition.
Example of Persons: Jack stone
20
2.3.3 Type 3 Features
Rhetorical features : These are semantic signals which capture readers attention in
an business event sentences, following eleven features are identified in the literature
as described in (Luismaurijo et al.,2012).
Continuation: There are more ideas to come e.g.: moreover, furthermore, in
addition, another.
Change of direction: There is a change of topic e.g.: in spite of, nevertheless,
the opposite, on the contrary.
Sequence: There is an order in the presenting ideas e.g.: in first place, next,
into.
Illustration: Gives an example e.g.: to illustrate, in the same way as, for in-
stance, for example.
Emphasis: Increases the relevance of an idea these are the most important sig-
nals e.g.: it all boils down to, the most substantial issue, should be noted, the crux
of the matter, more than anything else.
Cause, condition or result : There is a conditional or modification coming to
following idea e.g.: if, because, resulting from.
Spatial signals: Denote locations e.g.: in front of, between, adjacent, west, east,
north, south, beyond.
Comparison or contrast: Comparison of two ideas e.g.: analogous to, better,
less than, less, like, either.
Conclusion: Ending the introduction of the idea and may have special impor-
tance e.g.: in summary, from this we see, last of all, hence, finally.
21
Fuzz: There is an idea that is not clear e.g.: looks like, seems like, alleged,
maybe, probably, sort of.
2.4 Description of Vectorizers
All the features extracted with the given sentence has to be converted into vectors
using vectorizers such as Count-vectorizers, TF-IDF vectorizers. The method
used to convert words to vectors is bag of words approach. Following are the two
vectorizers described below using bag of words approach.
2.4.1 Count Vectorizers
This module uses the counts of the words present within a sentence and converts
it into vectors by building the dictionary for the word to vector conversion(Harris,
1954). An illustrative of example count vectorizer is described below.
2.4.1.1 Example of Count Vectorizer
Consider the following two sentences.
a) John likes to watch movies. Mary likes movies too.
b) John also likes to watch football games.
Based on the above two sentences dictionary is constructed as follows:
{ John:1 , likes:2 , to:3 , watch:4 , movies:5 , also:6 , football:7 , games:8 , Mary:9
, too:10 }
The dictionary constructed has 10 distinct words. Using the indexes of the dictio-
nary, each sentence is represented by a 10-entry vector:
sentence1 : [1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
sentence2 : [1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
where each entry of the vectors refers to count of the corresponding entry in the
dictionary (this is also the histogram representation). For example, in the first
vector (which represents sentence 1), the first two entries are [1,2]. The first entry
corresponds to the word John which is the first word in the dictionary, and its
value is 1 because John appears in the first sentence 1 time. Similarly the second
22
entry corresponds to the word likes which is the second word in the dictionary,
and its value is 2 because likes appears in the first sentence 2 times. This vector
representation does not preserve the order of the words in the original sentences.
2.4.2 Term Frequency and Inverse Document Frequency
Term frequency and inverse document frequency describes importance of a partic-
ular word in the document or a sentence, in a collection of documents (Manning
et al.,2008).
Term-Frequency(Tf)-is defined as the number of occurrences of a particular word
within that document.
Inverse Document Frequency(IDF)-is defined as number of documents containing
the particular word.
For analysis in our work using tf-idf with bag of words approach, we treat each
document as a sentence.
Tf-idf is a short form of term frequency-inverse document frequency, is a
numerical statistic that is intended to reflect how important a particular word is
to a sentence, in a collection of sentences.
2.4.2.1 Formulation of Term Frequency and Inverse Document Fre-
quency
Term-Frequency formulation: The term frequency tf(t,d) describes the num-
ber of times that term t occurs in the sentence d. Two formulations of term fre-
quency is described below:
a)Boolean frequencies: tf(t,d) = 1 if t occurs in d and 0 otherwise.
b)logarithmically scaled frequency: tf(t,d) = 1 + log tf(t,d) if t occurs in d and 0
otherwise.
Inverse Document Frequency formulation: Inverse document frequency is
a measure of how much information a particular word provides in a sentence, in
comparison with the collection of sentences under consideration. Inverse document
frequency measures whether the term is common or rare across all collection of
23
sentences. Mathematically it is described as follows:
idf(t, D) = log N
|{d∈D:t∈d}|
N: total number of sentences in the collection of sentences.
{d ∈ D : t ∈ d} : number of sentences d where the term t appears (i.e., tf(t, d) =
0). If the term is not in the set of sentences, this will lead to a division-by-zero.
It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}|.
2.4.2.2 Description of Combination of TF and IDF
Then tf-idf is calculated as Tf-idf (t, d, D) = tf(t, d) ×idf(t, D)
A high weight in tf-idf is reached by a high term frequency (in the given sentence)
and a low document frequency of the term in the whole collection of sentences,
the weights hence tend to filter out common terms.
2.4.2.3 Example of TF-IDF Vectorizer
Consider term frequency tables (2.1) and (2.2) for a collection consisting of only
two sentences, as listed below.
Table 2.1: The words and their counts in the sentence1
Sentence1
Term Term Count
this 1
is 1
a 2
sample 1
Table 2.2: The words and their counts in the sentence2
Sentence2
Term Term Count
this 1
is 1
another 2
example 3
24
The calculation of tf-idf for the term this in sentence1 is performed as follows.
Term-frequency in its basic form, is just the frequency that we look up in appro-
priate table. In this case, it’s one for the term this in sentence1. IDF for the term
this in sentence1 is given as follows:
idf(this, D) = log N
|{d∈D:t∈d}|
The numerator of the fraction i.e. N, is the number of sentences which is two. The
number of sentences in which this appears is also two, giving the IDF as:
idf(this, D) = log 2
2
= 0
So the tf-idf value is zero for this term and with the basic definition this is true of
any term that occurs in all sentences.
Now consider the example term from the sentence2, which occurs three times but
in only one sentence that is sentence2. For this sample tf-idf of example term is
tf(example, sentence2) = 3
idf(example, D) = log 2
1
≈ 0.3010
tfidf(example, sentence2) = tf(example, sentence2)×ID(example, D) = 3×0.3010 ≈
0.9030
Chapter 3
Machine Learning Algorithms
Used For Analysis Of Business
Event Recognition
This chapter discusses the set of machine learning algorithms which were imple-
mented as part of our work. The semi-supervised approach with naive Bayes
expectation- maximization and active learning with QBC are used in our work to
increase the amount labeled data. Gradient boosting classifier, ada boost classifier,
random forest classifier, multilayered feed forward network and covolutional neu-
ral network are used to classify business event data in our work. So the following
sections would give us the detailed understanding regarding these algorithms.
3.1 Semi-supervised Learning using Naive Bayes
Classifier with Expectation-Maximization Al-
gorithm
In this approach first a naive Bayes classifier is built in the standard supervised
fashion from the limited amount of labeled training data and we perform classifica-
tion of the unlabeled data with the naive Bayes model, by noting the probabilities
associated with each class. Then we rebuild a new naive Bayes classifier using all
25
26
the labeled data and unlabeled using the estimated class probabilities as true class
labels. We iterate this process of classifying the unlabeled data and rebuilding the
naive Bayes model until it converges to a stable classifier, and the corresponding
set of labels for the unlabeled data are obtained. The algorithm is summarized
below as in (KamalNigam et al.,2006).
1. Inputs: collections Xl of labeled sentences and Xu of unlabeled sentences.
2. Build an initial naive Bayes classifier K*
from the labeled sentences Xl only.
3. Loop while classifier parameters improve, as measured by the change in
l(K|X, Y ), (the log probability of the labeled and unlabeled data and the
prior)
(a) (E-step) Use the current classifier, K*
, to estimate component member-
ship of each unlabeled sentence, i.e. the probability that each mixture
component (and class) generated each sentence P(Y = cj|X = xi;K*
),
where X and Y are random variables, cj output of jth class and xi is
ith input datapoint.
(b) (M-step) Re-estimate the classifier,K*
, given the estimated component
membership of each sentence. Use maximum a posteriori parameter
estimation to find K*
= arg max
K
P(X, Y |K) P(K)
4. Output is the classifier K*
, that takes an unlabeled sentence and predicts a
class label.
3.2 Active Learning using Ensemble classifiers
with QBC approach
The ensemble classifiers used for QBC approach are gradient boosting classifier,
ada boosting classifier and random forest classifier. Described below is this ap-
proach in brief.
27
3.2.1 Query by committee
In this approach an ensemble of hypotheses is learned and examples that cause
maximum disagreement amongst this committee (with respect to the predicted
categorization) are selected as the most informative examples from a pool of unla-
beled examples. QBC iteratively selects examples to be labeled for training and in
each iteration committee of classifiers based on current training set predict labels.
Then it evaluates the potential utility of each example in the unlabeled set, and
selects a subset of examples with the highest expected utility. The labels for these
examples are acquired and they are transferred to the training set. Typically,
the utility of an example is determined by some measure of disagreement in the
committee about its predicted label. This process is repeated until the number of
available requests for labels is exhausted.
3.3 Ensemble Models for Classification of Busi-
ness Events using Bag-Of-Words Approach
Series of classifiers that were trained on the dataset included SVM, decision-tree
classifier, random-forest classifier, ada boost classifier, gradient boosting classifier
and SGD classifier. Among these classifiers boosting classifiers and random forest
classifiers performed better compared to other classifiers. We used three ensemble
classifiers with decision tree as the base learner, namely gradientboostingclassi-
fier, adaboostclassifier and randomforestclassifier. In the end, classification of the
business event datasets was done by majority voting of these classifiers. The de-
scription and mathematical formulation for each ensemble classifier is given below.
3.3.1 Gradient Boosting Classifier
Boosting algorithms are set of machine learning algorithms, which builds strong
classifier from set of weak classifiers, typically decision tress. Gradient boosting is
one such algorithm which builds the model in a stage-wise fashion, and it general-
izes the model by allowing optimization of an arbitrary differentiable loss function.
28
The differentiable loss function in our case is Binomial deviance loss function.
The algorithm is implemented as follows as described in (Friedman et al.,2001).
Input : training set (Xi, yi), where i = 1....n , Xi ∈ H ⊆ Rn
and yi ∈ [−1, 1]
differential loss function L(y, F(X)) which in our case is Binomial deviance loss
function defined as log(1 + exp(−2yF(X))) and M are the number of iterations .
1. Initialize model with a constant value:
F0(X) =arg min
γ
n
i=1 L(yi, γ).
2. For m = 1 to M:
(a) Compute the pseudo-responses:
rim = − ∂L(yi,F(Xi))
∂F(Xi)
F(X)=Fm−1(X)
for i = 1, . . . , n.
(b) Fit a base learnerhm(X) to pseudo-response, train the pseudo response
using the training set {(Xi, rim)}n
i=1.
(c) Compute multiplier γm by solving the optimization problem:
γm = arg min
γ
n
i=1 L (yi, Fm−1(Xi) + γhm(Xi)).
(d) Update the model: Fm(X) = Fm−1(X) + γmhm(X).
3. Output FM (X) = M
m=1 γmhm(X)
The value of the weight γm is found by an approximated newton raphson solution
given as γm = Xi∈hm
rim
Xi∈hm|rim|(2−|rim|)
3.3.2 AdaBoost Classifier
In adaBoost we assign (non-negative) weights to points in the data set which
are normalized, so that it forms a distribution. In each iteration, we generate a
training set by sampling from the data using the weights, i.e. the data point (Xi, yi)
would be chosen with probability wi, where wi is the current weight for that data
point. We generate the training set by such repeated independent sampling. After
learning the current classifier, we increase the (relative) weights of data points that
are misclassified by the current classifier. We generate a fresh training set using the
modified weights and so on. The final classifier is essentially a weighted majority
29
voting by all the classifiers. The description of the algorithm as in (Freund et al.,
1995) is given below:
Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn
, yi ∈ [−1, 1]
1. Initialize: wi(1) = 1
n
, ∀i, each data point is initialized with equal weight, so
when data points are sampled from the probability distribution the chance
of getting the data point in the training set is equally likely.
2. We assume that there as M classifiers within the Ensembles.
For m=1 to M do
(a) Generate a training set by sampling with wi(m).
(b) Learn classifier hm using this training set.
(c) let ξm = n
i=1 wi(m) I[yi=hm(Xi)] where IA is the indicator function of
A and is defined as
IA = 1 if [yi = hm(Xi)]
IA = 0 if [yi = hm(Xi)]
so ξm is the error computed due to the mth classifier.
(d) Set αm=log(1−ξm
ξm
) computed hypothesis weight, such that αm > 0 be-
cause of the assumption that ξ < 0.5.
(e) Update the weight distribution over the training set as
wi(m + 1)= wi(m) exp(αmI[yi=hm(Xi)])
Normalization of the updated weights so that wi(m+1) is a distribution.
wi(m + 1) =
wi(m+1)
i wi(m+1)
end for
3. Output is final vote h(X) = sgn( M
m=1 αmhm(x)) is the weighted sum of all
classifiers in the ensemble.
In the adaboost algorithm M is a parameter. Due to the sampling with weights,
we can continue the procedure for arbitrary number of iterations. Loss function
used in adaboost algorithm is exponential loss function and for a particular data
point its defined as exp(−yif(Xi))
30
3.3.3 Random Forest Classifiers
Random forests are a combination of tree predictors, such that each tree depends
on the values of a random vector sampled independently, and with the same dis-
tribution for all trees in the forest. The main difference between standard decision
trees and random forest is, in decision trees, each node is split using the best split
among all variables and in random forest, each node is split using the best among
a subset of predictors randomly chosen at that node. In random forest classifier
ntree bootstrap samples are drawn from the original data, and for each obtained
bootstrap sample grow an unpruned classification decision tree, with the following
modification: at each node, rather than choosing the best split among all predic-
tors, randomly sample mtry of the predictors and choose the best split from among
those variables. Predict new data by aggregating the predictions of the ntree trees
(i.e., majority votes for classification). The algorithm is described as follows as
in(Brieman, 2001):
Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn
, where D is the whole
dataset.
for i=1,...,B:
1. Choose a boostrap sample Di from D.
2. Construct a decision Tree Ti from the bootstrap sample Di such that at each
node, choose a random subset of m features and only consider splitting on
those features.
Finally given the testdata Xt take the majority votes for classification. Here B is
the number of bootstrap data sets generated from original data set D.
3.4 Multilayer Feed Forward with Back Propa-
gation using word embedding approach
In this approach word embedding framework was used to convert word to vectors
and followed by applying MFN to classify the business event dataset. Genism
31
module in python was used to build this word embedding, using training of the
words on CBOW(continuous bag of words model) or skip gram model of the un-
supervised neural language model (Tomas Mikolov et al.,2013), where each word
is assigned with an uniformly distributed (U[-1,1]) 100 to 300 dimensonal vector.
Once we have initialized vectors for the each word using word embedding, using
window based approach, we can convert word vectors into a single global sen-
tence vector. The obtained global sentence vector is fed into MFN network with
back-propagation for classification of the sentences using soft-max classifier. The
following is implementation of the algorithm:
1. Initialization of each word in a sentence with a uniformly distributed (U[-
1,1]) dense vector of 100 to 300 dimension.
2. From a given set of words within a sentence, we concatenate word-embedding
vectors to form an matrix for that particular sentence.
3. Choosing an appropriate window size on the obtained matrix and corre-
spondingly applying max-pooling approach based on the window size we
finally obtain a global sentence vector.
4. The obtained global sentence vectors are fed into multilayer feed forward
network with back propagation using soft-max as the loss function. For
regularization of the multilayer feed forward network and to avoid overfitting
of the data, dropout mechanism is adopted.
3.5 Convolutional Neural Networks for Sentence
Classification with unsupervised feature vec-
tor learning
In this model a simple CNN is trained with one layer of convolution on top of
word vectors obtained from an unsupervised neural language model(Yoon kin,
2014). These vectors were trained by (Mikolov et al.,2013) on 100 billion words
32
Figure 3.1: The Image describes the architecture for Convolutional Neural
Network with Sentence Modelling for multichannel architecture
of Google news, and is a publicly available model. The following figure (3.1) de-
scribes the architecture of the CNN for sentence modeling.
let N be the number of sentences in the vocabulary and n be the number of words
in the particular sentence, where xi ∈ Rk
be the k-dimensional word vector corre-
sponding to the i-th word in the sentence. A sentence of length n (padded where
necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concate-
nation of words xi , xi+1 , . . . , xi+j. The weight vector w is initialized with
33
a random uniformly distributed matrix of size Rh×k
. A convolution operation
involves a filter weight matrix w, which is applied to a window of h words of a par-
ticular sentence to produce a new feature. For example, a feature ci is generated
from a window of words xi:i+h−1 by
ci = f(w · xi:i+h−1 + b).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic
tangent. This filter is applied to each possible window of words in the sentence
[x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map.
c = [c1, c2, ..., cn−h+1]
with c ∈ Rn−h+1
, We then apply a max-pooling operation over the feature map
and take the maximum value c∗
= max[c] as the feature corresponding to this
particular filter. The idea is to capture the most important feature one with the
highest value for each feature map. This pooling scheme naturally deals with
variable sentence lengths. We have described the process by which one feature is
extracted from one filter. The model uses multiple filters (with varying window
sizes) to obtain multiple features. These features are also called as unsupervised
features, because they are obtained by applications of different filters with variable
window sizes randomly. These features form the penultimate layer and are passed
to a fully connected soft-max layer whose output is the probability distribution
over labels.
To avoid overfitting of CNN models, drop-out mechanism is adopted.
3.5.1 Variations in CNN sentence models
CNN-rand: Our baseline model where all words are randomly initialized and
then modified during training.
CNN-static: A model with pre-trained vectors from word2vec. All words in-
cluding the unknown ones that are randomly initialized are kept static and only
the other parameters of the model are learned. Initializing word vectors with those
34
obtained from an unsupervised neural language model is a popular method to im-
prove performance in the absence of a large supervised training set. We use the
publicly available word2vec vectors that were trained on 100 billion words from
Google news. The vectors have dimensionality of 300 and were trained using the
continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in
the set of pre-trained words are initialized randomly.
Chapter 4
Results and Discussions
In this chapter we discuss about the results obtained from the machine learning
algorithms that were applied in our work.
1. Semi-supervised learning approach using naive Bayes with expectation-maximization
and active learning with QBC to increase the number of labeled data points.
2. The ensemble classifiers, MFN and CNN models to classify the obtained
business data.
Described below are the results and analysis of the algorithms.
4.1 Semi-supervised Learning Implementation us-
ing Naive Bayes with Expectation Maximiza-
tion
Initially we had few data points which were labeled in supervised manner. To
formulate and solve the problem as Business Event Classification problem, our
primary objective was to increase the number of labeled data points.
In accordance with the algorithm of semi-supervised learning using naive Bayes
classifier with expectation maximization explained in section 3.1, the following are
the results in three domains of acquisition, vendor-supplier and job events with
the training data taken as 30%, 40% and 50% of the whole dataset and rest of the
pool as test data.
35
36
4.1.1 Results and Analysis of Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754. Stated below
are some of the observations made for large pool of unlabeled test data, by varying
the data points in test data and train data. Table (4.1) and figure (4.1) shows
the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.2),(4.4)
and(4.6) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and the remaining corresponding part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.3),(4.5) and (4.7) displays the ROC curves
for variations of 30% , 40% and 50 % as training data and corresponding remaining
part as the test data.
Analysis: We observe an increase in accuracy and F-scores as there is an in-
crease in the number of training data points which is as expected. But increase
in accuracies are of higher values compared to the increase in F-scores, because
number of true negatives are more in number compared to true positives. The
confusion matrix plot shows slight variations in number of true positives and true
negatives as the number of training data points are increased. The ROC curve
shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
37
Table 4.1: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for vendor-supplier data
Semi-supervised learning using naive Bayes for vendor-supplier dataset
Training data points in per-
centage
Accuracy F-scores Description on dataset
30 0.5597 0.5915 Testing data=527,training
data=227
40 0.7434 0.65 Testing data=454,training
data=300
50 0.7765 0.674 Testing data=376,training
data=376
Figure 4.1: Variations in Accuracies and F1-scores for Vendor-supplier data
using Naive-Bayes, semi-supervised technique
38
Figure 4.2: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for VNSP
Figure 4.3: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for VNSP
39
Figure 4.4: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for VNSP
Figure 4.5: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for VNSP
40
Figure 4.6: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for VNSP
Figure 4.7: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for VNSP
41
4.1.2 Results and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2810 data points. Stated
below some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data. Table (4.2) and figure (4.8)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.9),(4.11)
and(4.13) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.10),(4.12) and (4.14) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is an increase in the number of training data, we observe
an increase in accuracy and F-scores. But there is a vast difference in values of
accuracies compared to F-scores, because the number of true negatives are very
high in comparison to true positives which are low in number, which is clearly
visible in our confusion matrix plot. The ROC curve shows an increase in TPR
and area under the ROC curve for increase in the number of training datapoints.
42
Table 4.2: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Job event data
Semi-supervised learning using naive Bayes for Job dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7483 0.4444 Testing data=1967,training
data=842
40 0.7544 0.4863 Testing data=1686,training
data=1123
50 0.8014 0.52 Testing data=1405,training
data=1404
Figure 4.8: Variations in Accuracies and F1-scores for Job event data using
Naive-Bayes, semi-supervised technique
43
Figure 4.9: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for JOB
Figure 4.10: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for JOB
44
Figure 4.11: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for JOB
Figure 4.12: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for JOB
45
Figure 4.13: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for JOB
Figure 4.14: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for JOB
46
4.1.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Stated below are some of the observations made for large pool of unlabeled test
data, by varying the data points in test data and train data. Table (4.3) and
figure (4.15) shows the variations of accuracies and F-scores for 30% , 40% and
50 % as training data and corresponding remaining part as the test data. The
figures (4.16),(4.18) and(4.20) displays the confusion matrix for variations of 30%
, 40% and 50 % as the training data and corresponding remaining part as the test
data. The confusion matrix gives insights regarding the number of true-positives,
true-negatives, false-positives and false-negatives. Figures (4.17),(4.19) and (4.21)
displays the ROC curves for variations of 30% , 40% and 50 % as the training data
and corresponding remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are slightly higher
compared to the increase in accuracies. Because number of true positives are
more in compared to true negatives, due to this classifier is more biased towards
the positive class compared to negative. So the amount of false positives are higher
in this scenario, which is clearly visible from the confusion matrix plots. The ROC
curve shows an increase in TPR and area under the ROC curve for increase in the
number of training datapoints.
47
Table 4.3: Variation in accuracies and F-scores in Semi-supervised learning
using naive Bayes for Acquisition event data
Semi-supervised learning using naive Bayes for Acquisition dataset
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.7929 0.8178 Testing data=966,training
data=413
40 0.7989 0.82 Testing data=828,training
data=521
50 0.8057 0.8241 Testing data=689,training
data=690
Figure 4.15: Variations in Accuracies and F1-scores for Acquisition event data
using Naive-Bayes, semi-supervised technique
48
Figure 4.16: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Acquisition
Figure 4.17: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Acquisition
49
Figure 4.18: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Acquisition
Figure 4.19: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Acquisition
50
Figure 4.20: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Acquisition
Figure 4.21: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Acquisition
51
4.2 Active Learning implementation by Query
by committee approach
In accordance with the algorithm of active learning explained in section 3.1.2,
following are some of the results in three domains of acquisition, vendor-supplier
and job events with the training data taken as 30% ,40% and 50 % of the whole
dataset and prediction of test data using majority voting of three ensemble clas-
sifiers gradient boosting classifier, ada boost classifier and random forest classifier
(i.e. query by committee approach).
4.2.1 Results and Analysis for Vendor-Supplier Event Data
Vendor-supplier data points labeled in supervised manner were 754 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.4) and figure (4.22)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.23),(4.25)
and(4.27) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.24),(4.26) and (4.28) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: We observe an increase in accuracy and F-scores, as there is increase
in the number of training data. But increase in accuracies are of higher values
compared to the increase in F-scores because number of true negatives are more in
compared to true positives. This method performs better compared to the semi-
supervised naive Bayes classifier. The confusion matrix plot shows slight variations
in number of true positives and true negatives as the number of training data points
are increased. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints.
52
Table 4.4: Variation in accuracies and F-scores using Active Learning for
Vendor-supplier event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.842 0.7348 Testing data=529,training
data=225
40 0.84 0.7352 Testing data=454,training
data=300
50 0.8643 0.76 Testing data=376,training
data=376
Figure 4.22: variations in Accuracies and F1-scores for Vendor-supplier data
using Active learning
53
Figure 4.23: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Vendor-supplier
Figure 4.24: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Vendor-supplier
54
Figure 4.25: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Vendor-Supplier
Figure 4.26: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Vendor-supplier
55
Figure 4.27: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Vendor-supplier
Figure 4.28: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Vendor-supplier
56
4.2.2 Result and Analysis for Job Event Data
Job event data points labeled in supervised manner were 2809 data points. Fol-
lowing are some of the observations made for large pool of unlabeled test data, by
varying the data points in test data and train data.Table (4.5) and figure (4.29)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.30),(4.32)
and(4.34) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.31),(4.33) and (4.35) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: As there is increase in the number of training data, we observe an
increase in accuracy and F-scores. But there is a vast difference in accuracies
compared to F-scores, because number of true negatives are very high in compared
to true positives which are low in number which is clearly visible in our confusion
matrix plot. The ROC curve shows an increase in TPR and area under the ROC
curve for increase in the number of training datapoints. The performance of this
method is better compared to the semi-supervised naive Bayes classifier which is
clearly visible from our results.
57
Table 4.5: Variation in accuracies and F-scores using Active Learning for Job
event data
Active Learning using QBC approach
Training data points in per-
centage
Accuracy F-scores Description on data
30 0.9054 0.6204 Testing data=1967,training
data=842
40 0.9116 0.6558 Testing data=1686,training
data=1123
50 0.9216 0.6758 Testing data=1405,training
data=1404
Figure 4.29: Variations in Accuracies and F1-scores for Job event data using
Active learning
58
Figure 4.30: Confusion matrix for large pool of testing data of 70 percent and
training data of 30 percent for Job
Figure 4.31: Roc curve for large pool of testing data of 70 percent and training
data of 30 percent for Job
59
Figure 4.32: Confusion matrix for large pool of testing data of 60 percent and
training data of 40 percent for Job
Figure 4.33: Roc curve for large pool of testing data of 60 percent and training
data of 40 percent for Job
60
Figure 4.34: Confusion matrix for large pool of testing data of 50 percent and
training data of 50 percent for Job
Figure 4.35: Roc curve for large pool of testing data of 50 percent and training
data of 50 percent for Job
61
4.2.3 Result and Analysis for Acquisition Event Data
Acquisition event data points labeled in supervised manner were 1380 data points.
Following are some of the observations made for large pool of unlabeled test data,
by varying the data points in test data and train data. Table (4.6) and figure (4.36)
shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training
data and corresponding remaining part as the test data. The figures (4.37),(4.39)
and(4.41) displays the confusion matrix for variations of 30% , 40% and 50 % as the
training data and corresponding remaining part as the test data. The confusion
matrix gives insights regarding the number of true-positives, true-negatives, false-
positives and false-negatives. Figures (4.38),(4.40) and (4.42) displays the ROC
curves for variations of 30% , 40% and 50 % as the training data and corresponding
remaining part as the test data.
Analysis: There is an increase in the accuracy and F-scores as there is increase
in the number of training data points. Increase in F-scores are equivalent to the
increase in accuracies. The confusion matrix plots show that the number of true
positives and true negatives are nearly equal in number. The ROC curve shows an
increase in TPR and area under the ROC curve for increase in the number of train-
ing datapoints. This method shows slight improvement in accuracies compared to
the semi-supervised naive Bayes classifier.
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055
Thesis-aligned-sc13m055

More Related Content

What's hot

First Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportFirst Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportLauren Zahringer
 
UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013Farhana Iffah
 
Abstract contents
Abstract contentsAbstract contents
Abstract contentsloisy28
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Jason Cheung
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analyticsLeon Henry
 
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...Path of the Blue Eye Project
 
Innovation Trends: Web 2.0
Innovation Trends: Web 2.0Innovation Trends: Web 2.0
Innovation Trends: Web 2.0Jari Ognibeni
 
Health Accounts Production Tool SHA 2011
Health Accounts Production Tool SHA 2011Health Accounts Production Tool SHA 2011
Health Accounts Production Tool SHA 2011HFG Project
 
User manual rpfieu_saft_1.04_en (1)
User manual rpfieu_saft_1.04_en (1)User manual rpfieu_saft_1.04_en (1)
User manual rpfieu_saft_1.04_en (1)Henrique Coelho
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)Henry Johansen
 
Bachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingBachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingVantharith Oum
 
Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Robert Pratten
 
Baron rpsych
Baron rpsychBaron rpsych
Baron rpsychmacha1864
 

What's hot (20)

First Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis ReportFirst Solar Inc., Strategic Analysis Report
First Solar Inc., Strategic Analysis Report
 
UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013UiTM Thesis guidelines 2013
UiTM Thesis guidelines 2013
 
Abstract contents
Abstract contentsAbstract contents
Abstract contents
 
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
Trinity Impulse - Event Aggregation to Increase Stundents Awareness of Events...
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
Introduction to BIRT
Introduction to BIRTIntroduction to BIRT
Introduction to BIRT
 
Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011Google Search Quality Rating Program General Guidelines 2011
Google Search Quality Rating Program General Guidelines 2011
 
Hung_thesis
Hung_thesisHung_thesis
Hung_thesis
 
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
Health Literacy Online: A Guide to Writing and Designing Easy-to-Use Health W...
 
Aregay_Msc_EEMCS
Aregay_Msc_EEMCSAregay_Msc_EEMCS
Aregay_Msc_EEMCS
 
Innovation Trends: Web 2.0
Innovation Trends: Web 2.0Innovation Trends: Web 2.0
Innovation Trends: Web 2.0
 
Assembly
AssemblyAssembly
Assembly
 
Health Accounts Production Tool SHA 2011
Health Accounts Production Tool SHA 2011Health Accounts Production Tool SHA 2011
Health Accounts Production Tool SHA 2011
 
User manual rpfieu_saft_1.04_en (1)
User manual rpfieu_saft_1.04_en (1)User manual rpfieu_saft_1.04_en (1)
User manual rpfieu_saft_1.04_en (1)
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)
 
c
cc
c
 
Bachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile AdvertisingBachelor's Thesis: Mobile Advertising
Bachelor's Thesis: Mobile Advertising
 
Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling Getting started in Transmedia Storytelling
Getting started in Transmedia Storytelling
 
Baron rpsych
Baron rpsychBaron rpsych
Baron rpsych
 

Viewers also liked

Automating a Better World: Clusters, Impact, and the Bottom Line
Automating a Better World: Clusters, Impact, and the Bottom LineAutomating a Better World: Clusters, Impact, and the Bottom Line
Automating a Better World: Clusters, Impact, and the Bottom LineRising Media, Inc.
 
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella CIT Marbella
 
Pasos de Murga- Mujica pamela.ppt
Pasos de Murga- Mujica pamela.pptPasos de Murga- Mujica pamela.ppt
Pasos de Murga- Mujica pamela.pptPamela Mujica
 
4. bab iii
4. bab iii4. bab iii
4. bab iiikoko1601
 
History of the thriller 1920 now
History of the thriller 1920 nowHistory of the thriller 1920 now
History of the thriller 1920 nowbilliemae
 
Nanotechnology
NanotechnologyNanotechnology
Nanotechnologyabhipray
 
Automatic traffic density monitoring and control system
Automatic traffic density monitoring and control systemAutomatic traffic density monitoring and control system
Automatic traffic density monitoring and control systemShubham Kulshreshtha
 

Viewers also liked (14)

Automating a Better World: Clusters, Impact, and the Bottom Line
Automating a Better World: Clusters, Impact, and the Bottom LineAutomating a Better World: Clusters, Impact, and the Bottom Line
Automating a Better World: Clusters, Impact, and the Bottom Line
 
Ferchus123
Ferchus123Ferchus123
Ferchus123
 
ShopPing Targeted Rewards Pitch Deck
ShopPing Targeted Rewards Pitch DeckShopPing Targeted Rewards Pitch Deck
ShopPing Targeted Rewards Pitch Deck
 
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella
Beach Club Estrella del Mar - IV Negocio Abierto Provincial CIT Marbella
 
Lo physics 101
Lo physics 101Lo physics 101
Lo physics 101
 
Pasos de Murga- Mujica pamela.ppt
Pasos de Murga- Mujica pamela.pptPasos de Murga- Mujica pamela.ppt
Pasos de Murga- Mujica pamela.ppt
 
Divusi dan inovasi erna sari
Divusi dan inovasi erna sariDivusi dan inovasi erna sari
Divusi dan inovasi erna sari
 
4. bab iii
4. bab iii4. bab iii
4. bab iii
 
History of the thriller 1920 now
History of the thriller 1920 nowHistory of the thriller 1920 now
History of the thriller 1920 now
 
Nanotechnology
NanotechnologyNanotechnology
Nanotechnology
 
niyantra 2014
niyantra 2014niyantra 2014
niyantra 2014
 
Training for trainer
Training for trainerTraining for trainer
Training for trainer
 
Eastern philosophy
Eastern philosophyEastern philosophy
Eastern philosophy
 
Automatic traffic density monitoring and control system
Automatic traffic density monitoring and control systemAutomatic traffic density monitoring and control system
Automatic traffic density monitoring and control system
 

Similar to Thesis-aligned-sc13m055

Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerAdel Belasker
 
Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaHendrik Drachsler
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - CopyBhavesh Jangale
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media LayerLinkedTV
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Gora Buzz
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_finalDario Bonino
 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsKelly Lipiec
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsChetan Hireholi
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filterLinkedTV
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]Rajon
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensMonica Waters
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Webpfleidi
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Priyanka Kapoor
 

Similar to Thesis-aligned-sc13m055 (20)

Thesis
ThesisThesis
Thesis
 
document
documentdocument
document
 
Work Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel BelaskerWork Measurement Application - Ghent Internship Report - Adel Belasker
Work Measurement Application - Ghent Internship Report - Adel Belasker
 
Smart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao ParganaSmart Speaker as Studying Assistant by Joao Pargana
Smart Speaker as Studying Assistant by Joao Pargana
 
Sanskrit Parser Report
Sanskrit Parser ReportSanskrit Parser Report
Sanskrit Parser Report
 
QBD_1464843125535 - Copy
QBD_1464843125535 - CopyQBD_1464843125535 - Copy
QBD_1464843125535 - Copy
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
Specification of the Linked Media Layer
Specification of the Linked Media LayerSpecification of the Linked Media Layer
Specification of the Linked Media Layer
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020
 
bonino_thesis_final
bonino_thesis_finalbonino_thesis_final
bonino_thesis_final
 
diss
dissdiss
diss
 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing Units
 
A Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software DefectsA Machine Learning approach to predict Software Defects
A Machine Learning approach to predict Software Defects
 
Content and concept filter
Content and concept filterContent and concept filter
Content and concept filter
 
Chat Application [Full Documentation]
Chat Application [Full Documentation]Chat Application [Full Documentation]
Chat Application [Full Documentation]
 
Bwl red book
Bwl red bookBwl red book
Bwl red book
 
Applying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small ScreensApplying The Rapid Serial Visual Presentation Technique To Small Screens
Applying The Rapid Serial Visual Presentation Technique To Small Screens
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
 
FULLTEXT01.pdf
FULLTEXT01.pdfFULLTEXT01.pdf
FULLTEXT01.pdf
 
Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)Report on e-Notice App (An Android Application)
Report on e-Notice App (An Android Application)
 

Thesis-aligned-sc13m055

  • 1. BUSINESS EVENT RECOGNITION FROM ONLINE NEWS ARTICLES A Project Report submitted by MOHAN KASHYAP.P in partial fulfillment of the requirements for the award of the degree of MASTER OF TECHNOLOGY IN MACHINE LEARNING AND COMPUTING DEPARTMENT OF MATHEMATICS INDIAN INSTITUTE OF SPACE SCIENCE AND TECHNOLOGY Thiruvananthapuram - 695547 May 2015
  • 2. i CERTIFICATE This is to certify that the thesis titled ’Business Event Recognition From Online News Articles’, submitted by Mohan Kashyap.P, to the Indian Insti- tute of Space Science and Technology, Thiruvananthapuram, for the award of the degree of MASTER OF TECHNOLOGY, is a bonafide record of the research work done by him under my supervision. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Dr. Sumitra.S Supervisor Department of Mathematics IIST Dr. Raju K. George Head of Department Department of Mathematics IIST Place: Thiruvananthapuram May, 2015
  • 3. ii DECLARATION I declare that this thesis titled ’Business Event Recognition From Online News Articles’ submitted in fulfillment of the Degree of MASTER OF TECH- NOLOGY is a record of original work carried out by me under the supervision of Dr. Sumitra .S, and has not formed the basis for the award of any degree, diploma, associateship, fellowship or other titles in this or any other Institution or University of higher learning. In keeping with the ethical practice in reporting scientific information, due acknowledgements have been made wherever the find- ings of others have been cited. Mohan Kashyap.P SC13M055 Place: Thiruvananthapuram May, 2015
  • 4. iii Abstract Business Event Recognition From Online News Articles deals with the ex- traction of news from text related to business events in three domains Acquisition, Vendor-supplier and Job. The developed automated model for recognizing busi- ness events would predict whether the online news article contains a business event or not. For developing the model, the data related to business events had been crawled. Since the manual labeling of data was expensive, semi-supervised learn- ing techniques were used for getting required labeled data and then tagged data had been pre-processed using techniques of natural language processing. Further on vectorizers were applied on the text to convert it into numerics using bag-of- words, word-embedding and word2vec approaches. In the end ensemble classifiers with bag-of-words approach and CNN(Convolutional Neural Network) using word- embedding, word2vec approaches were applied on the business event datasets and the results obtained were found to be promising.
  • 5. Acknowledgements First and foremost I thank God, The Almighty, for all his blessings. I would like to express my deepest gratitude to my research supervisor and teacher, Dr. Sumitra .S for her continuous guidance and motivation without which this research work would never have been possible. I cannot thank her enough for her limit- less patience and dedication in correcting my thesis report and molding it into its present form. Interactions with her taught me the importance of small things which are often overlooked and an exposure to the art of approaching a problem at different angles. These lessons will be invaluable for me in my career and personal life ahead. Besides my supervisor, I would like to thank my mentor, Mr.Mahesh C.R. of TataaTsu Idea Labs for allowing me to carry my thesis work in their organization.I would like to express my deepest gratitude for him for helping me to realize my abilities and build confidence in me to to solve challenging problems in Machine Learning turning my theoretical understanding into practical real time implemen- tation.My sincere thanks also goes to all the faculty members of Mathematics Department for their encouragement, questions and insightful comments. I am grateful to my project lead at Tataatsu Idea labs Mr.Vinay and his team of the Tataatsu Idea labs for helping me in implementation of project work . I would like to appreciate Research Scholar Shiju.S.Nair for extending his ’any time’ help and thanks to him for providing additional inputs to my work. last but not the least i would like to thank my classmates and friends in IIST for their company and for all the fun we had during the two years of M.Tech.Hailing from Electrical Background not that great in coding special thanks goes to Praveen and Sailesh for constantly supporting me and guiding through for two years in ma- chine learning and Arvindh too for inspiring me in certain regards of the course iv
  • 6. v work. Last but not the least, I would like to thank my parents and my sister for their care, love and support throughout my life.
  • 7.
  • 8. Contents Acknowledgements iv List of Figures vi List of Tables ix List of Abbreviations xii 1 Introduction 1 1.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Natural Language Processing . . . . . . . . . . . . . . . . . 3 Information Extraction and Retrieval: . . . . . . . . . 4 Named Entity Recognition: . . . . . . . . . . . . . . . 4 Parts Of Speech Tagging: . . . . . . . . . . . . . . . . 4 1.2.2 Text to Numeric Conversion . . . . . . . . . . . . . . . . . . 4 1.2.3 Data Labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3.1 Semi-supervised Technique . . . . . . . . . . . . . . 5 1.2.3.2 Active Learning . . . . . . . . . . . . . . . . . . . . 6 Uncertainty sampling: . . . . . . . . . . . . . . . . . . 6 Query by the committee: . . . . . . . . . . . . . . . . 6 Expected model change: . . . . . . . . . . . . . . . . 7 Expected error reduction: . . . . . . . . . . . . . . . . 7 Variance reduction: . . . . . . . . . . . . . . . . . . . 7 1.2.4 Learning Classifiers . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.4.1 Ensemble Classifiers . . . . . . . . . . . . . . . . . 7 Bagging: . . . . . . . . . . . . . . . . . . . . . . . . . 8 Boosting: . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.2.5 Convolutional Neural Network . . . . . . . . . . . . . . . . . 8 Convolutional Layer: . . . . . . . . . . . . . . . . . . 8 Activation Function: . . . . . . . . . . . . . . . . . . 9 Pooling layer: . . . . . . . . . . . . . . . . . . . . . . 9 Fully connected layer: . . . . . . . . . . . . . . . . . . 9 Loss layer: . . . . . . . . . . . . . . . . . . . . . . . . 9 1.2.6 Measures used for Analysing the Results: . . . . . . . . . . . 9 i
  • 9. Contents ii 1.3 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Part-of-speech (POS) pattern of the phrase: . . . . . 11 Extraction of rhetorical signal features: . . . . . . . . 11 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 Second Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Third Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.4.3 Fourth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.4.4 Fifth Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2 Data Extraction,Data pre-processing and Feature Engineering 14 2.1 Crawling of Data from Web . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Labeling of Extracted Data . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.1.1 Acquisition Data Description . . . . . . . . . . . . 15 Acquisition event: . . . . . . . . . . . . . . . . . . . . 15 Non Acquisition event: . . . . . . . . . . . . . . . . . 15 2.2.1.2 Vendor-Supplier Data Description . . . . . . . . . . 15 Vendor-Supplier event: . . . . . . . . . . . . . . . . . 15 Non Vendor-Supplier event: . . . . . . . . . . . . . . 16 2.2.1.3 Job Data Description . . . . . . . . . . . . . . . . . 16 Job event: . . . . . . . . . . . . . . . . . . . . . . . . 16 Non Job event: . . . . . . . . . . . . . . . . . . . . . 16 2.2.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . 16 2.3 Feature Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Type 1 Features . . . . . . . . . . . . . . . . . . . . . . . . . 17 Noun, Noun-phrases and Proper nouns: . . . . . . . . 17 Example of Noun-phrase: . . . . . . . . . . . . . . . . 17 Word-Capital: . . . . . . . . . . . . . . . . . . . . . . 17 Example of Capital words: . . . . . . . . . . . . . . . 17 Parts of speech tag pattern: . . . . . . . . . . . . . . 17 Example of POS tag pattern Adj-Noun format: . . . 18 2.3.2 Type 2 Features . . . . . . . . . . . . . . . . . . . . . . . . . 18 Organization Name: . . . . . . . . . . . . . . . . . . . 18 Example of Organization names: . . . . . . . . . . . . 18 Organization references: . . . . . . . . . . . . . . . . 18 Examples of Organization references: . . . . . . . . . 18 Location: . . . . . . . . . . . . . . . . . . . . . . . . . 18 Example of location as feature . . . . . . . . . . . . . 18 Persons: . . . . . . . . . . . . . . . . . . . . . . . . . 18 Example of Persons: . . . . . . . . . . . . . . . . . . . 18 2.3.3 Type 3 Features . . . . . . . . . . . . . . . . . . . . . . . . . 19 Continuation: . . . . . . . . . . . . . . . . . . . . . . 19 Change of direction: . . . . . . . . . . . . . . . . . . . 19
  • 10. Contents iii Sequence: . . . . . . . . . . . . . . . . . . . . . . . . 19 Illustration: . . . . . . . . . . . . . . . . . . . . . . . 19 Emphasis: . . . . . . . . . . . . . . . . . . . . . . . . 19 Cause, condition or result : . . . . . . . . . . . . . . . 19 Spatial signals: . . . . . . . . . . . . . . . . . . . . . 19 Comparison or contrast: . . . . . . . . . . . . . . . . 19 Conclusion: . . . . . . . . . . . . . . . . . . . . . . . 19 Fuzz: . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Description of Vectorizers . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1 Count Vectorizers . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4.1.1 Example of Count Vectorizer . . . . . . . . . . . . 20 2.4.2 Term Frequency and Inverse Document Frequency . . . . . . 21 2.4.2.1 Formulation of Term Frequency and Inverse Doc- ument Frequency . . . . . . . . . . . . . . . . . . . 21 Term-Frequency formulation: . . . . . . . . . . . . . . 21 Inverse Document Frequency formulation: . . . . . . . 21 2.4.2.2 Description of Combination of TF and IDF . . . . 22 2.4.2.3 Example of TF-IDF Vectorizer . . . . . . . . . . . 22 3 Machine Learning Algorithms Used For Analysis Of Business Event Recognition 24 3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation- Maximization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Active Learning using Ensemble classifiers with QBC approach . . . 25 3.2.1 Query by committee . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Ensemble Models for Classification of Business Events using Bag- Of-Words Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.1 Gradient Boosting Classifier . . . . . . . . . . . . . . . . . . 26 3.3.2 AdaBoost Classifier . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.3 Random Forest Classifiers . . . . . . . . . . . . . . . . . . . 29 3.4 Multilayer Feed Forward with Back Propagation using word em- bedding approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.5 Convolutional Neural Networks for Sentence Classification with un- supervised feature vector learning . . . . . . . . . . . . . . . . . . . 30 3.5.1 Variations in CNN sentence models . . . . . . . . . . . . . . 32 CNN-rand: . . . . . . . . . . . . . . . . . . . . . . . . 32 CNN-static: . . . . . . . . . . . . . . . . . . . . . . . 32 4 Results and Discussions 34 4.1 Semi-supervised Learning Implementation using Naive Bayes with Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.1 Results and Analysis of Vendor-Supplier Event Data . . . . 35 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 Results and Analysis for Job Event Data . . . . . . . . . . . 40 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 40
  • 11. Contents iv 4.1.3 Result and Analysis for Acquisition Event Data . . . . . . . 45 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 Active Learning implementation by Query by committee approach . 50 4.2.1 Results and Analysis for Vendor-Supplier Event Data . . . . 50 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.2.2 Result and Analysis for Job Event Data . . . . . . . . . . . 55 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.3 Result and Analysis for Acquisition Event Data . . . . . . . 60 Analysis: . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.3 Comparison of Semi-supervised techniques and Active learning ap- proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.4 Results of Ensemble Classifiers with different Parameter tuning . . 65 4.4.1 Analysis for vendor-supplier event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . 65 4.4.2 Analysis for Job event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . . . . . . . 68 4.4.3 Analysis for Acquisition event Data using 100 estimators within the ensemble as the parameter . . . . . . . . . . . . . 71 4.4.4 Analysis for Vendor-Supplier event Data using 500 estima- tors within the ensemble as the parameter . . . . . . . . . . 74 4.4.5 Analysis for Job event Data using 500 estimators within the ensemble as the parameter . . . . . . . . . . . . . . . . . . . 77 4.4.6 Analysis for Acquisition event Data using 500 estimators within the ensemble as the parameter . . . . . . . . . . . . . 80 4.5 Final Accuracies and F-score estimates for the model . . . . . . . . 83 4.5.1 Final Analysis of Vendor-Supplier Dataset . . . . . . . . . . 84 4.5.2 Final Analysis of Job Dataset . . . . . . . . . . . . . . . . . 85 4.5.3 Final Analysis of Acquisition Dataset . . . . . . . . . . . . . 87 4.6 Results obtained for MFN with Word Embedding . . . . . . . . . . 90 4.7 Results obtained for Convolutional Neural Networks . . . . . . . . . 90 4.7.1 Analysis for Vendor-Supplier Data using CNN-rand and CNN- word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.7.2 Analysis for Acquisition Data using CNN-rand and CNN- word2vec Model . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.7.3 Analysis for Job using CNN-rand and CNN-word2vec Model 92 4.8 Result Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5 Conclusions and Future work 95 5.1 Challenges Encountered in Business Event Recognition . . . . . . . 95 5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Future works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
  • 13.
  • 14. List of Figures 3.1 The Image describes the architecture for Convolutional Neural Net- work with Sentence Modelling for multichannel architecture . . . . 31 4.1 Variations in Accuracies and F1-scores for Vendor-supplier data us- ing Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . 36 4.2 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . 37 4.3 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 37 4.4 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . 38 4.5 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 38 4.6 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . 39 4.7 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for VNSP . . . . . . . . . . . . . . . . . . . . . . 39 4.8 Variations in Accuracies and F1-scores for Job event data using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . . . . . 41 4.9 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . 42 4.10 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 42 4.11 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . 43 4.12 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 43 4.13 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . 44 4.14 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for JOB . . . . . . . . . . . . . . . . . . . . . . . 44 4.15 Variations in Accuracies and F1-scores for Acquisition event data using Naive-Bayes, semi-supervised technique . . . . . . . . . . . . 46 4.16 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . 47 4.17 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 47 vii
  • 15. List of Figures viii 4.18 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . 48 4.19 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 48 4.20 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Acquisition . . . . . . . . . . . . . . 49 4.21 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Acquisition . . . . . . . . . . . . . . . . . . . 49 4.22 variations in Accuracies and F1-scores for Vendor-supplier data us- ing Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.23 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier . . . . . . . . . . . . 52 4.24 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier . . . . . . . . . . . . . . . . 52 4.25 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Vendor-Supplier . . . . . . . . . . . 53 4.26 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Vendor-supplier . . . . . . . . . . . . . . . . 53 4.27 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier . . . . . . . . . . . . 54 4.28 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier . . . . . . . . . . . . . . . . 54 4.29 Variations in Accuracies and F1-scores for Job event data using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.30 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Job . . . . . . . . . . . . . . . . . . 57 4.31 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 57 4.32 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Job . . . . . . . . . . . . . . . . . . 58 4.33 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 58 4.34 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 59 4.35 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 59 4.36 Variations in Accuracies and F1-scores for Acquisition event data using Active learning . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.37 Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . 62 4.38 Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition . . . . . . . . . . . . . . . . . . . 62 4.39 Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . 63
  • 16. List of Figures ix 4.40 Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition . . . . . . . . . . . . . . . . . . . 63 4.41 Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . 64 4.42 Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job . . . . . . . . . . . . . . . . . . . . . . . 64 4.43 variations in Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 67 4.44 Confusion matrix for Vendor-supplier with number of estimators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.45 Roc curve for for Vendor-supplier with number of estimators as 100 68 4.46 Variations in Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.47 Confusion matrix for Job with number of estimators as 100 . . . . . 71 4.48 Roc curve for for Job with number of estimators as 100 . . . . . . . 71 4.49 Variations in Accuracies and F1-scores for Acquisition data for 5- fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 73 4.50 Confusion matrix for Acquisition with number of estimators as 100 74 4.51 Roc curve for for Acquisition with number of estimators as 100 . . . 74 4.52 Variations in Accuracies and F1-scores for Vendor-supplier data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . 76 4.53 Confusion matrix for Vendor-supplier with number of estimators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.54 Roc curve for for Vendor-supplier with number of estimators as 500 77 4.55 Variations in Accuracies and F1-scores for Job data for 5-fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . 79 4.56 Confusion matrix for Job with number of estimators as 500 . . . . . 80 4.57 Roc curve for for Job with number of estimators as 500 . . . . . . . 80 4.58 Variations in Accuracies and F1-scores for Acquisition data for 5- fold using 3 ensemble classifiers . . . . . . . . . . . . . . . . . . . . 82 4.59 Confusion matrix for Acquisition with number of estimators as 500 83 4.60 Roc curve for for Acquisition with number of estimators as 500 . . . 83 4.61 Variations in Accuracies and F1-scores for Vendor-supplier data for whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.62 Variations in Accuracies and F1-scores for Job data for whole data set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.63 Variations in Accuracies and F1-scores for Acquisition data 5-folds accuracy variations for whole data set . . . . . . . . . . . . . . . . . 89 4.64 CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.65 CNN-rand and CNN-word2vec models for Acquisition on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . 93 4.66 CNN-rand and CNN-word2vec models for Job on whole data set with 3-folds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
  • 17.
  • 18. List of Tables 1.1 Recognition of Named-Event Passages in News Articles and its ap- plication to our work . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.1 The words and their counts in the sentence1 . . . . . . . . . . . . . 22 2.2 The words and their counts in the sentence2 . . . . . . . . . . . . . 22 4.1 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for vendor-supplier data . . . . . . . . . . . . . . 36 4.2 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Job event data . . . . . . . . . . . . . . . . . 41 4.3 Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Acquisition event data . . . . . . . . . . . . . 46 4.4 Variation in accuracies and F-scores using Active Learning for Vendor- supplier event data . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.5 Variation in accuracies and F-scores using Active Learning for Job event data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.6 Variation in accuracies and F-scores using Active Learning for Ac- quisition event data . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.7 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in vendor-supplier data set 66 4.8 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in vendor-supplier data . . . . 66 4.9 Variation in accuracies and F-scores for random forest classifier for number of parameter estimate as 100 in vendor-supplier data . . . . 66 4.10 Variation in test score for accuracy and F-score with Vendor-supplier data using voting of three ensemble classifiers with number of esti- mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.11 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in Job data set . . . . . . 69 4.12 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Job data set . . . . . . . . 69 4.13 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Job data set . . . . . . . . 69 4.14 variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 100 70 4.15 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 100 in Acquisition data set . . 72 xi
  • 19. List of Tables xii 4.16 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 100 in Acquisition data set . . . . 72 4.17 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 100 in Acquisition data set . . . . 72 4.18 Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of esti- mators as 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.19 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in vendor-supplier data set 75 4.20 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in vendor-supplier data set . . 75 4.21 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in vendor-supplier data set . . 75 4.22 Variation in test score for accuracy and F-score with vendor-supplier data using voting of three ensemble classifiers with number of esti- mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.23 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in Job data set . . . . . . 78 4.24 Variation in accuracies and F-scores for Ada Boosting classifier for number of parameter estimate as 500 in Job data set . . . . . . . . 78 4.25 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Job data set . . . . . . . . 78 4.26 variation in test score for accuracy and F-score with Job data using voting of three ensemble classifiers with number of estimators as 500 79 4.27 Variation in accuracies and F-scores for Gradient Boosting classifier for number of parameter estimate as 500 in Acquisition data set . . 81 4.28 Variation in accuracies and F-scores for Random forest classifier for number of parameter estimate as 500 in Acquisition data set . . . . 81 4.29 Variation in accuracies and F-scores for Ada boosting classifier for number of parameter estimate as 500 in Acquisition data set . . . . 81 4.30 Variation in test score for accuracy and F-score with Acquisition data using voting of three ensemble classifiers with number of esti- mators as 500 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.31 Variation in accuracies and F-scores for Gradient Boosting classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . 84 4.32 Variation in accuracies and F-scores for Ada Boosting classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 84 4.33 Variation in accuracies and F-scores for Random forest classifier for whole vendor-supplier data set . . . . . . . . . . . . . . . . . . . . . 85 4.34 Variation in accuracies and F-scores for Gradient Boosting classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.35 Variation in accuracies and F-scores for Ada Boosting classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.36 Variation in accuracies and F-scores for Random forest classifier for whole Job data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
  • 20. List of Tables xiii 4.37 Variation in accuracies and F-scores for Gradient Boosting classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . 88 4.38 Variation in accuracies and F-scores for Random forest classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 88 4.39 Variation in accuracies and F-scores for Ada boosting classifier for whole Acquisition data set . . . . . . . . . . . . . . . . . . . . . . . 89 4.40 Variation in test score for MFN with word embedding . . . . . . . . 90 4.41 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Vendor-supplier on whole data set . . . . . . . . . . . . . 91 4.42 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Acquisition on whole data set . . . . . . . . . . . . . . . 92 4.43 Variation in accuracies and F-scores CNN-rand and CNN-word2vec models for Job on whole data set . . . . . . . . . . . . . . . . . . . 93
  • 21.
  • 22. List of Abbreviations POS Parts of speech NLTK Natural Language Tool Kit QBC Query By Committee NLP Natural language processing IE Information Extraction IR Information Retrieval NER Named entity recognizer ML Machine Learning CNN Convolutional Neural network MFN Multilayer feed forward network TF Term Frequency IDF Inverse Document Frequency CBOW Continuous bag of words ROC Receiver operator characteristic TPR True Positive Rate FPR False Positive Rate TP True Positives FP False Positives TN True Negatives FN False Negatives xv
  • 23.
  • 24. Chapter 1 Introduction Textual information present in the web is unstructured and extracting useful infor- mation from it for a specific purpose is tedious and challenging. So over the years various methods have been proposed for extraction of useful text. Text mining is the domain that deals with the process of deriving high-quality information from unstructured text. The goal of text mining is essentially to convert unstructured text into structured data and there by extracting some useful information by ap- plying techniques of natural language processing (NLP) and pattern recognition. The concept of manual text mining was first introduced in mid-1980’s (Hobbs et al., 1982). Over the past decade technological advancements in this field have been significant with building of automated approaches for extraction and analysis of text. Text mining is composed of five major components: information retrieval, data mining, machine learning, statistics and computational linguistics. The application of text mining are in the various domains which includes: (a) Named entity recognition which deals with identification of named text features such as people, organization and location(sang et al., 2003). (b) Recognition of pattern identified entities which deals with extraction of features such as telephone numbers, e-mail address and built-in database quantities that can be discerned using regular expression or other pattern matches(Nadeau et al., 2007). (c) Co- reference deals with the identification of noun phrases and other terms that refer to these nouns eg: such as her, him, it and their(Soon et al., 2001). (d) Sentiment analysis which includes extracting various forms of users intent information such 1
  • 25. 2 as sentiment, opinion, mood and emotion. Text analytics techniques are helpful in the analysis of sentiment at different topics level(pang et al., 2008). (e) Spam detection which deals with the classification of e-mail as spam or not, based on application of statistical machine learning and text mining techniques.(Rowe et al., 2007) (f) News analytics which deals with extraction of vital news or information content of an interest to the end user. (h) Business event recognition from online news articles. Business Event Recognition From Online News Articles captures semantic signals and identifies pattern from unstructured text to extract business events in three main domains i.e. acquisition, vendor-supplier and job events from online news articles. Acquisition business event news pattern in general is of the context organization acquiring another organization. The keywords used in acquisition business events scenario are acquire, buy, sell, sold, bought, take-over, purchase and merger. Vendor-supplier business event news pattern in general is of the context organization obtaining a contract from another organization to perform certain task for that organization. The keywords used in vendor-supplier business event scenario are contract, procure, sign, implement, select, award, work, agreement, deploy, provide, team, collaborate, deliver and joint. Job business event news pattern in general is of the context of appointments of persons to prominent positions, hiring and firing of people within an organizations. Our thesis deals with the development of an automated model for busi- ness event recognition from online news articles. For developing the automated model of Business Event Recognition From Online News Articles, data has been crawled from different websites such as reutersnews, businesswireindia.com and prnewswire.com. Since the manual labeling of the data was expensive, the gath- ered data was subjected to semi-supervised learning techniques and active learn- ing methods for getting more tagged event data in the domains of acquisition, vendor-supplier and job. Then the obtained tagged data was pre-processed using natural language processing techniques. Further on, for the conversion of text to numerics the bag-of-words, word-embedding and word2vec approaches were
  • 26. 3 used. Final analysis on the business event dataset was performed using ensem- ble classifiers with bag-of-words approach and convolutional neural network with word-embedding, word2vec approach. 1.1 Model Architecture Given a set of online articles or documents which is of interest from the end user, our developed automated model must predict the class output as whether the given sentence contains business event related to acquisition, vendor-supplier and job events. If the automated model predicts a sentence as a business event then it has to give out additional information regarding the description of the event such as entities involved in that particular event like organizations and people. Provid- ing such additional information helps the end user to make better decisions with quicker insights. On daily basis around the world business events are happening. An orga- nization as a competitor would like to understand the business analytics of other organizations. The development of an automated approach for identifying such business events helps in better decision making, increases efficiency and helps to develop better business strategies for that organization. 1.2 Methods Given below sections are the methods used for our work. 1.2.1 Natural Language Processing The concept of information extraction and information retrieval in our work deals with extraction and retrieval of the business news containing the business event sentences from the online news articles. The concepts of part-of-speech (POS) tagging and named entity recognition (NER) are used as part of feature engi- neering in our work. The pattern of POS tagging is essential in extracting useful
  • 27. 4 semantic features and NER is useful in extracting entity type features like or- ganizations, persons and location which form the integral part of any business event. The framework for our project is formed by the concepts of information extraction (IE) and information retrieval (IR). Discussed below are the methods of information extraction and retrieval, named entity recognition(NER) and parts- of-speech(POS) tagging which forms the baseline for implementation of natural language processing techniques(Liddy, 2001). Information Extraction and Retrieval: Information extraction and retrieval deals with searching of required text, extraction of semantic information from text and storing of retrieved information in a particular form in the database. Named Entity Recognition: Named entity recognition deals with extraction from a document of text a set of people or places and type based entities which include organizations. Parts Of Speech Tagging: The pattern of POS tagging forms an important set of features for any NLP related task. Extraction of proper semantic features is possible with the pattern of POS tags. 1.2.2 Text to Numeric Conversion The conversion of word to vectors was implemented using the bag-of-words and word embedding approach. Described below is the overview of these concepts. In bag-of-words approach piece of text or sentence of a document is repre- sented as the bag(multiset) of words disregarding the grammar and the word order, but keeping the multiplicity of the words intact(Harris, 1954). Word embedding is the collective name for a set of language modeling and feature learning tech- niques in natural language processing where words from the sentences are mapped to vectors of real numbers in a low dimensional space, relative to the vocabulary size(Tomas Mikilov et al.,2013). One of the major disadvantages of bag-of-words approach is it fails to capture the semantics of a particular word within in a sentence, because it converts words to
  • 28. 5 vectors disregarding the grammar and the order. Consider the following sentences where the bag-of-words approach fails. After drawing money from the Bank Ravi went to the river Bank. In the bag-of-words approach there is no distinction between financial word Bank and river Bank. This problem of capturing semantics of the word to a certain extent overcome by word-embedding. In word embedding each word is represented by a 100 to 300 dimensional uniformly distributed (i.e U[-1,1])random dense vector. Word-embedding with window approach captures semantics to certain extent. 1.2.3 Data Labeling The extracted data labeled in supervised manner were few in number. The sections below describe the semi-supervised technique and active learning methods. 1.2.3.1 Semi-supervised Technique The naive Bayes classifier forms the integral part in the implementation of semi- supervised learning using naive Bayes classifier with expectation maximization to increase the number of labeled data points (kamalNigam et al.,2006). Discussed below is an overview of the naive Bayes classifier. Naive Bayes classifiers are probabilistic classifiers which use the concept of Bayes theorem. In naive Bayes classifier assumption is made that one feature is condi- tionally independent from another feature. The modeling of a naive Bayes classifier is described as follows: Given a input feature vector x=(x1, x2, ...., xn)T we need to calculate which class does this feature vector belong to i.e. p(Yk|x1, x2, ...xn) for each k classes, where Yk is the output variable for the kth class. Now using the concept of the Bayes theorem we can rewrite the above probability expression as: p(Yk|x) = p(Yk)p(x|Yk) p(x) where p(Yk) = are prior probabilities for that particular class p(x|Yk) = is the maximum likelihood estimator
  • 29. 6 p(x) = is the probability of choosing that particular data point The naive Bayes classifier framework uses the maximum posteriori rule, to pick the output which is most probable output for that particular class. Maximum posteriori probabilities = Prior× Maximum likelihood Naive Bayes classifier assigns a label ˆy = Yk based on MAP rule, and classifier prediction is given as follows. ˆy = argmax k∈{1,...,K} p(Yk) n i=1 p(xi|Yk). In text mining the classifier used is Multinomial Naive Bayes classifier with the bag-of-words approach. 1.2.3.2 Active Learning Active learning with query by committee approach using ensemble classifiers was implemented as part of our work to increase the number of labeled data(Abe and Mamitsuka, 1998). Discussed below is the concept of active learning. Active learning is a special case of semi-supervised machine learning in which a learning algorithm is able to interactively query the user (or some other informa- tion source) to obtain the desired outputs at new data points. There are situations in which unlabeled data is abundant but manually labeling is expensive. In such a scenario, learning algorithms can actively query the user for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. The following discussed are query strategies for querying most informative data points in active learning. Uncertainty sampling: Uncertainty Sampling deals with labeling of those points for which current model is least certain about or for which labeled data point en- tropy value is maximum, by querying with the user. Query by the committee: A combination of classifiers are trained on the current labeled data points. Finally take the vote on the predicted labels of the
  • 30. 7 classifiers and query the labels by the user for labels which the classifiers disagree the most. Expected model change: Labeling of the data points which would result in drastic change in the current model. Expected error reduction: Labeling of those points which would reduce the most current model’s generalization error. Variance reduction: Labeling of those points which minimizes the output vari- ance of the current model the most, which are the points near by to the marginal hyper-plane in SVM. 1.2.4 Learning Classifiers The classifiers used in our work were ensemble classifiers and convolutional neural networks(CNN). The sections below describe the basic overview of concepts that are required to understand the ensembles methods and CNN that was implemented as in our work. 1.2.4.1 Ensemble Classifiers Random forest classifier implemented in our work (Breiman, 2001) is derived from the concept of bootstrap aggregation technique. Gradient boosting classifier (Friedman et al., 2001) and ada boost classifier (Freund et al., 1995) implemented in our work are derived from the boosting algorithm technique. Discussed below are concepts of ensembles with bagging and boosting. Ensembles are the concept of combining classifiers by which the performance of the combined classifier on the model is increased compared to the performance of each individual classifier on the model. There two different kinds of ensemble methods in practice, one is bagging also called as bootstrap aggregation and the other method is boosting.
  • 31. 8 Bagging: In bagging, from a subset of training data at each instance a single classifier is learnt. From a training set M, then its possible to draw M random instances using a uniform distribution. These M samples drawn at each instant can be learned using a classifier and this process is repeated several times. Since the sampling drawn is with replacement there are chances that certain data points get picked up twice and certain data points don’t, within the subset of the original training dataset. A classifier is learnt using these subsets of training data set for each cycle. Final prediction is based on taking vote of classifier for different generated datasets. Boosting: In boosting using a subset of data at each instance a single classifier or different classifiers are learnt. Boosting technique analyses the performance of learnt classifier in each instant and forces the classifier to learn those training sam- ple instances which was incorrectly classified by the classifier. Instead of choosing the M training instances randomly using a uniform distribution, one chooses the training instances in such a manner as to favour the instances that have not been accurately learned by the classifier. The final prediction is performed by taking the weighted vote of each classifier learnt various different instances. 1.2.5 Convolutional Neural Network Convolutional neural networks for sentence modelling, trained on softmax-classifier was implemented in our work(Yoon kin, 2014). Discussed below is the overview of a generalized convolutional neural network and softmax-classifier. Convolutional neural network is a type of feed-forward neural network whose archi- tecture consists of three main layers which are convolutional layer, pooling layer, fully-connected layer and loss layer. The stacking of these layers forms the full conv-net architecture. Convolutional Layer: In conventional convolutional operation of sobel, pewitt filters on the image is useful in detecting the features of the image such as edge,
  • 32. 9 corners etc., in comparison the convolutional neural net, parameters of each convo- lutional kernel i.e. (the each filter) is trained by the back-propagation algorithm. There are many convolution kernels in each layer, and each kernel is replicated over the entire image with the same parameters. The function of the convolution operators is to extract different features of the input. Activation Function: Activation function used in convolutional neural net- works are hyperbolic-tangent function f(x) = tanh(x), RELU function f(x) = max(0, x) and sigmoid function f(x) = 1 1+exp(−x) . Pooling layer: This set of layer captures the most important feature by per- forming the max operation on the obtained feature map vector. All such max features obtained form the penultimate layer. Fully connected layer: Finally, after several convolutional and max pooling layers, the high-level reasoning in the neural network is done via fully connected layers. A fully connected layer takes all neurons in the previous layer (be it fully connected, pooling, or convolutional) and connects it to every single neuron it has. Fully connected layers are not spatially located anymore (you can visualize them as one-dimensional), so there can be no convolutional layers after a fully connected layer. Loss layer: From fully connected layer, a soft-max classifier is present at the output layer with a soft-max loss function, to predict the probabilistic labels. Soft-max classifier is a classifier obtained from the soft-max function, for a sample input vector x, the predicted probability output y for the jth class among K classes is given as: P(y = j|x) = exTwj K k=1 exTwk 1.2.6 Measures used for Analysing the Results: The performance measures used for our results and analysis are as described as follows(Powers et al., 2007)
  • 33. 10 1. F-score: F-score is a kind of measure used in information retrieval for mea- suring the sentence classification performance, since it takes only the true positives into account and not true negatives while calculating the measure. The F-score is described as: F1 = 2×TP 2×TP+FP+FN 2. Confusion matrix: The performance of any classification algorithm can be visualized by a specific table layout which is called the confusion matrix. Each column of the confusion matrix represents the instances in a predicted class, while each row of the confusion matrix represents the instances in an actual class. 3. ROC curve: It is a plot between TPR and FPR. The TPR desrcibes about the number of true positive results present in total positive samples. FPR describes about the set of incorrect positive results present in total negative samples. Area under the ROC curve is a measure of accuracy. 4. Accuracy: Accuracy of a classification problem is defined as: accuracy = TP+TN P+N 1.3 Related Works The paper which is close our work of Business Event Recognition From Online News Articles is Recognition of Named-Event Passages in News Articles (Luis- Marujo et al., 2012). This paper describes about the method for finding named events in violent behaviour domain and business domain, in the specific passages of news articles that contain information about such events and report their pre- liminary evaluation results using techniques of NLP and ML algorithms. The following table (1.1) describes about the paper Recognition of Named-Event Pas- sages in News Articles and its application to our work. As part of feature engineering used in our work, we have used some of the feature engineering techniques as in (LuisMarujo et al., 2012). The following are the features extracted used in our work as with reference to this paper.
  • 34. 11 Part-of-speech (POS) pattern of the phrase: (e.g., < noun >, < adj, noun > , < adj, adj, noun >, etc.) Noun and noun phrases are the most common pattern observed in key phrases containing named events, verb and verb phrases are less frequent, and key phrases made of the remaining POS tags are rare. Extraction of rhetorical signal features: These are set of features which capture the readers attention in News Events which are continuation, change of direction, sequence, illustration, emphasis, cause, condition, result, spatial-signals, comparison/contrast, conclusion and fuzz. 1.4 Thesis Outline Second chapter deals with the extraction and understanding of business event data, third chapter deals with the application of machine-learning algorithms on obtained data, fourth chapter deals with results and analysis on the business event datasets and finally fifth chapter deals with conclusion of our work. 1.4.1 Second Chapter This chapter deals with extraction of business event data from web, followed by pre-processing of the data. Application of feature engineering on the obtained data and finally converting the data into vectors for applying machine-learning algorithms. 1.4.2 Third Chapter This chapter deals with applying semi-supervised techniques on the data to in- crease the number of data points and understanding of algorithms of different ensemble classifiers and CNN(convolutional neural network).
  • 35. 12 Table 1.1: Recognition of Named-Event Passages in News Articles and its application to our work Recognition of named-event passages from news articles Business event recognition from online news articles 1.Deals with the automatically identi- fying multi-sentence passages in a news article that describe named events. Specifically this paper focuses on ten event types, five are in the violent behavior domain: terrorism, suicide bombing, sex abuse, armed clashes, and street protests. The other five are in the business domain: man- agement changes, mergers and acqui- sitions, strikes, legal troubles, and bankruptcy. 1.Our work derived as part of Recog- nition of Named-Event Passages from News articles focuses exclusively of identifying the business events in the domains of merger and acquisition, vendor-supplier and job events. 2.The problem is solved as multiclass classification problem for which the training data was obtained as part of crowd-sourcing using amazon mechan- ical turk to label the data points as events or not events. Then using en- semble classifiers for the classification of these sentences for each event. Fi- nally aggregating passages containing the same events using HMM methods. 2.The problem in our case is solved as a binary classification for the three do- mains merger and acquisition, vendor- supplier and job, describing as, that particular event or not. The proce- dure used in our case varies as we label few data points in the supervised way and then by applying semi-supervised techniques we increase the number of labeled data points. Finally applying ensemble classifiers and convolutional neural networks for classification of the labeled data points.
  • 36. 13 1.4.3 Fourth Chapter This chapter deals with results and analysis of applied machine-learning techniques which includes semi-supervised learning analysis, ensemble classifier analysis and analysis of convolutional neural networks. 1.4.4 Fifth Chapter This chapter deals with challenges encountered while performing the project, con- clusion of the project and future scope of the project. 1.5 Thesis Contribution Our work focuses on business event recognition in three domains: acquisition, vendor-supplier and job. This whole process of identifying the business event news exclusively in these three domains using the knowledge of machine learning and NLP techniques is main contribution of our work.
  • 37.
  • 38. Chapter 2 Data Extraction,Data pre-processing and Feature Engineering Initial step in Business Event Recognition is business news extraction and labeling few of the extracted data, so that it can be formulated as a machine learning problem. The method of data extraction from web and labeling some of the extracted data is described in the following section. 2.1 Crawling of Data from Web There are several methods to crawl the data from web. One of such methods is described in this section. Every website has its own HTML logic. So sepa- rate crawling logic had to be written to extract text data from different websites. Modules used for data extraction in python are Beautiful-soup and Urllib. For ex- traction of the data for our study, information is extracted from particular websites such as businesswireindia.com, prnewswire.com and reuters news. Python language frame work used in our work. Urllib module in python is used get particular set of pages which has to be accessed within the web. Beautiful- soup module in python uses the HTML logic and finds the contents present within that page in the format of the title, subtitle and the description corresponding 15
  • 39. 16 to each content block by block. Finally the extracted title, subtitle and body contents are stored in the text-file formats. 2.2 Labeling of Extracted Data Since the business events are in the form of sentences, the text document obtained as raw text as part of web crawling, is split up into sentences using a natural language processing toolkit(NLTK) sentence tokenizer. Some of the sentences were labeled into three classes: merger and acquisition, vendor-supplier and job describing whether it is a business event or not. 2.2.1 Data Description Stated below is an illustration of data describing business event or not a business event in three classes of acquisition, vendor-supplier and job. 2.2.1.1 Acquisition Data Description Acquisition event: ARMONK, N.Y., April 10, 2014 /PRNewswire/ – IBM (NYSE: IBM) today announced a definitive agreement to acquire Silverpop, a privately held software company based in Atlanta, GA. Non Acquisition event: : Carlyle invests across four segments Corporate Pri- vate Equity Real Assets Global Market Strategies and Solutions in Africa Asia Australia Europe the Middle East North America and South America. 2.2.1.2 Vendor-Supplier Data Description Vendor-Supplier event: : Tri-State signs agreement with NextEra Energy Re- sources for new wind facility in eastern Colorado under the Director Jack stone; WESTMINSTER, Colo., Feb. 5, 2014 /PRNewswire/ – Tri-State Generation and Transmission Association, Inc. announced that it has entered into a 25-year agree- ment with a subsidiary of NextEra Energy Resources, LLC for a 150 megawatt
  • 40. 17 wind power generating facility to be constructed in eastern Colorado,in the ser- vice territory of Tri-State member cooperative K. C. Electric Association (Hugo, Colo.). Non Vendor-Supplier event: The implementation of the DebMed GMS elec- tronic hand hygiene monitoring system is a clear demonstration of Meadows Re- gional Medical Center’s commitment to patient safety, and we are excited to partner with such a forward-thinking organization that is focused on providing a state-of-the-art patient environment, said Heather McLarney, vice president of marketing, DebMed. 2.2.1.3 Job Data Description Job event: In a note to investors, analysts at FBR Capital Markets said the appointment of Nadella as Director of the company was a ”safe pick” compared to choosing an outsider. Non Job event: This partnership is an example of steps we are taking to sim- plify and improve the Tactile Medical order process, said Cathy Gendreau,Business Director. 2.2.2 Data Pre-processing The extracted business event sentences as raw text as part of data extraction was cleansed by removing of special characters and stop-words which include words like the, and, an etc. The stopwords are common between positive class and the negative class, and hence to enhance the difference between positive class and negative class we had to remove them. NLTK module in python was used for the above pre-processing of the data. 2.3 Feature Engineering To build hand crafted features, we had to observe the extracted unstructured data and recognize pattern, so that useful features could be extracted. The features
  • 41. 18 extracted are described below and examples for the corresponding features are taken with reference to vendor-supplier event in (2.2.1.2). 2.3.1 Type 1 Features Shallow semantic features- records the pattern and semantics of the data, which consists of the following features (Luismaurijo et al.,2012). Noun, Noun-phrases and Proper nouns: Entities form an integral part of business event sentences, so noun phrases and proper nouns are common in sen- tences containing business events. Using NLTK-parts of speech tagger from the sentence noun phrase was extracted, correspondingly nouns and proper-nouns. Example of Noun-phrase: Title agreement Next Era Energy wind facility eastern Colorado WESTMINSTER Colo. Feb. Generation Transmission Associa- tion Inc. agreement subsidiary NextEra Energy LLC megawatt wind power facility eastern Colorado service territory member K. C. Electric Association Hugo Colo. Word-Capital: If there is a capital letter present in sentence containing the business event, there is a higher chance of organizations, locations and persons be- ing present in the sentence, which inturn are entity kind of features which enhances the event recognition. Example of Capital words: WESTMINSTER LLC, K.C..Here WESTMIN- STER is an Location and K.C. is an Organization, an illustration of Entity features obtained from Capital-Word as feature. Parts of speech tag pattern: Pattern of parts of speech tags adjective-noun, i.e noun followed by adjective, adjective-adjective-noun, i.e noun followed by two adjectives are good sets of features in event recognition. Adjectives are used in scenarios to describe a noun, so there is higher chance of finding this kind of scenario in business event sentence. Noun and noun phrases are the most common
  • 42. 19 pattern observed in key phrases of business event sentence, verb and verb phrases are less frequent and key phrases made of the remaining POS tags are rare. Example of POS tag pattern Adj-Noun format: new wind 25-year agree- ment Tri-State member, here adjective is agreement and noun is Tri-State member. 2.3.2 Type 2 Features Entity type features: To capture the entities present in the business event sentence. Following described are some of the features. Organization Name: Organizations names are usually present in sentences containing business events, which often give additional insights as features in event recognition. Example of Organization names: Tri-state Tri-State Generation and Trans- mission Association, NextEra Energy Resources. Organization references: Referencing organization entities present in the busi- ness event sentences are taken as features. Examples of Organization references: K. C. Electric Association Location: Location is an important entity describing feature giving more insight to description of business events. Example of location as feature : WESTMINSTER Colo. Colorado Persons: Their is a higher chance of person or a group of people being present in the sentences that contain business events, so persons are used as features to enhance business event recognition. Example of Persons: Jack stone
  • 43. 20 2.3.3 Type 3 Features Rhetorical features : These are semantic signals which capture readers attention in an business event sentences, following eleven features are identified in the literature as described in (Luismaurijo et al.,2012). Continuation: There are more ideas to come e.g.: moreover, furthermore, in addition, another. Change of direction: There is a change of topic e.g.: in spite of, nevertheless, the opposite, on the contrary. Sequence: There is an order in the presenting ideas e.g.: in first place, next, into. Illustration: Gives an example e.g.: to illustrate, in the same way as, for in- stance, for example. Emphasis: Increases the relevance of an idea these are the most important sig- nals e.g.: it all boils down to, the most substantial issue, should be noted, the crux of the matter, more than anything else. Cause, condition or result : There is a conditional or modification coming to following idea e.g.: if, because, resulting from. Spatial signals: Denote locations e.g.: in front of, between, adjacent, west, east, north, south, beyond. Comparison or contrast: Comparison of two ideas e.g.: analogous to, better, less than, less, like, either. Conclusion: Ending the introduction of the idea and may have special impor- tance e.g.: in summary, from this we see, last of all, hence, finally.
  • 44. 21 Fuzz: There is an idea that is not clear e.g.: looks like, seems like, alleged, maybe, probably, sort of. 2.4 Description of Vectorizers All the features extracted with the given sentence has to be converted into vectors using vectorizers such as Count-vectorizers, TF-IDF vectorizers. The method used to convert words to vectors is bag of words approach. Following are the two vectorizers described below using bag of words approach. 2.4.1 Count Vectorizers This module uses the counts of the words present within a sentence and converts it into vectors by building the dictionary for the word to vector conversion(Harris, 1954). An illustrative of example count vectorizer is described below. 2.4.1.1 Example of Count Vectorizer Consider the following two sentences. a) John likes to watch movies. Mary likes movies too. b) John also likes to watch football games. Based on the above two sentences dictionary is constructed as follows: { John:1 , likes:2 , to:3 , watch:4 , movies:5 , also:6 , football:7 , games:8 , Mary:9 , too:10 } The dictionary constructed has 10 distinct words. Using the indexes of the dictio- nary, each sentence is represented by a 10-entry vector: sentence1 : [1, 2, 1, 1, 2, 0, 0, 0, 1, 1] sentence2 : [1, 1, 1, 1, 0, 1, 1, 1, 0, 0] where each entry of the vectors refers to count of the corresponding entry in the dictionary (this is also the histogram representation). For example, in the first vector (which represents sentence 1), the first two entries are [1,2]. The first entry corresponds to the word John which is the first word in the dictionary, and its value is 1 because John appears in the first sentence 1 time. Similarly the second
  • 45. 22 entry corresponds to the word likes which is the second word in the dictionary, and its value is 2 because likes appears in the first sentence 2 times. This vector representation does not preserve the order of the words in the original sentences. 2.4.2 Term Frequency and Inverse Document Frequency Term frequency and inverse document frequency describes importance of a partic- ular word in the document or a sentence, in a collection of documents (Manning et al.,2008). Term-Frequency(Tf)-is defined as the number of occurrences of a particular word within that document. Inverse Document Frequency(IDF)-is defined as number of documents containing the particular word. For analysis in our work using tf-idf with bag of words approach, we treat each document as a sentence. Tf-idf is a short form of term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a particular word is to a sentence, in a collection of sentences. 2.4.2.1 Formulation of Term Frequency and Inverse Document Fre- quency Term-Frequency formulation: The term frequency tf(t,d) describes the num- ber of times that term t occurs in the sentence d. Two formulations of term fre- quency is described below: a)Boolean frequencies: tf(t,d) = 1 if t occurs in d and 0 otherwise. b)logarithmically scaled frequency: tf(t,d) = 1 + log tf(t,d) if t occurs in d and 0 otherwise. Inverse Document Frequency formulation: Inverse document frequency is a measure of how much information a particular word provides in a sentence, in comparison with the collection of sentences under consideration. Inverse document frequency measures whether the term is common or rare across all collection of
  • 46. 23 sentences. Mathematically it is described as follows: idf(t, D) = log N |{d∈D:t∈d}| N: total number of sentences in the collection of sentences. {d ∈ D : t ∈ d} : number of sentences d where the term t appears (i.e., tf(t, d) = 0). If the term is not in the set of sentences, this will lead to a division-by-zero. It is therefore common to adjust the denominator to 1 + |{d ∈ D : t ∈ d}|. 2.4.2.2 Description of Combination of TF and IDF Then tf-idf is calculated as Tf-idf (t, d, D) = tf(t, d) ×idf(t, D) A high weight in tf-idf is reached by a high term frequency (in the given sentence) and a low document frequency of the term in the whole collection of sentences, the weights hence tend to filter out common terms. 2.4.2.3 Example of TF-IDF Vectorizer Consider term frequency tables (2.1) and (2.2) for a collection consisting of only two sentences, as listed below. Table 2.1: The words and their counts in the sentence1 Sentence1 Term Term Count this 1 is 1 a 2 sample 1 Table 2.2: The words and their counts in the sentence2 Sentence2 Term Term Count this 1 is 1 another 2 example 3
  • 47. 24 The calculation of tf-idf for the term this in sentence1 is performed as follows. Term-frequency in its basic form, is just the frequency that we look up in appro- priate table. In this case, it’s one for the term this in sentence1. IDF for the term this in sentence1 is given as follows: idf(this, D) = log N |{d∈D:t∈d}| The numerator of the fraction i.e. N, is the number of sentences which is two. The number of sentences in which this appears is also two, giving the IDF as: idf(this, D) = log 2 2 = 0 So the tf-idf value is zero for this term and with the basic definition this is true of any term that occurs in all sentences. Now consider the example term from the sentence2, which occurs three times but in only one sentence that is sentence2. For this sample tf-idf of example term is tf(example, sentence2) = 3 idf(example, D) = log 2 1 ≈ 0.3010 tfidf(example, sentence2) = tf(example, sentence2)×ID(example, D) = 3×0.3010 ≈ 0.9030
  • 48. Chapter 3 Machine Learning Algorithms Used For Analysis Of Business Event Recognition This chapter discusses the set of machine learning algorithms which were imple- mented as part of our work. The semi-supervised approach with naive Bayes expectation- maximization and active learning with QBC are used in our work to increase the amount labeled data. Gradient boosting classifier, ada boost classifier, random forest classifier, multilayered feed forward network and covolutional neu- ral network are used to classify business event data in our work. So the following sections would give us the detailed understanding regarding these algorithms. 3.1 Semi-supervised Learning using Naive Bayes Classifier with Expectation-Maximization Al- gorithm In this approach first a naive Bayes classifier is built in the standard supervised fashion from the limited amount of labeled training data and we perform classifica- tion of the unlabeled data with the naive Bayes model, by noting the probabilities associated with each class. Then we rebuild a new naive Bayes classifier using all 25
  • 49. 26 the labeled data and unlabeled using the estimated class probabilities as true class labels. We iterate this process of classifying the unlabeled data and rebuilding the naive Bayes model until it converges to a stable classifier, and the corresponding set of labels for the unlabeled data are obtained. The algorithm is summarized below as in (KamalNigam et al.,2006). 1. Inputs: collections Xl of labeled sentences and Xu of unlabeled sentences. 2. Build an initial naive Bayes classifier K* from the labeled sentences Xl only. 3. Loop while classifier parameters improve, as measured by the change in l(K|X, Y ), (the log probability of the labeled and unlabeled data and the prior) (a) (E-step) Use the current classifier, K* , to estimate component member- ship of each unlabeled sentence, i.e. the probability that each mixture component (and class) generated each sentence P(Y = cj|X = xi;K* ), where X and Y are random variables, cj output of jth class and xi is ith input datapoint. (b) (M-step) Re-estimate the classifier,K* , given the estimated component membership of each sentence. Use maximum a posteriori parameter estimation to find K* = arg max K P(X, Y |K) P(K) 4. Output is the classifier K* , that takes an unlabeled sentence and predicts a class label. 3.2 Active Learning using Ensemble classifiers with QBC approach The ensemble classifiers used for QBC approach are gradient boosting classifier, ada boosting classifier and random forest classifier. Described below is this ap- proach in brief.
  • 50. 27 3.2.1 Query by committee In this approach an ensemble of hypotheses is learned and examples that cause maximum disagreement amongst this committee (with respect to the predicted categorization) are selected as the most informative examples from a pool of unla- beled examples. QBC iteratively selects examples to be labeled for training and in each iteration committee of classifiers based on current training set predict labels. Then it evaluates the potential utility of each example in the unlabeled set, and selects a subset of examples with the highest expected utility. The labels for these examples are acquired and they are transferred to the training set. Typically, the utility of an example is determined by some measure of disagreement in the committee about its predicted label. This process is repeated until the number of available requests for labels is exhausted. 3.3 Ensemble Models for Classification of Busi- ness Events using Bag-Of-Words Approach Series of classifiers that were trained on the dataset included SVM, decision-tree classifier, random-forest classifier, ada boost classifier, gradient boosting classifier and SGD classifier. Among these classifiers boosting classifiers and random forest classifiers performed better compared to other classifiers. We used three ensemble classifiers with decision tree as the base learner, namely gradientboostingclassi- fier, adaboostclassifier and randomforestclassifier. In the end, classification of the business event datasets was done by majority voting of these classifiers. The de- scription and mathematical formulation for each ensemble classifier is given below. 3.3.1 Gradient Boosting Classifier Boosting algorithms are set of machine learning algorithms, which builds strong classifier from set of weak classifiers, typically decision tress. Gradient boosting is one such algorithm which builds the model in a stage-wise fashion, and it general- izes the model by allowing optimization of an arbitrary differentiable loss function.
  • 51. 28 The differentiable loss function in our case is Binomial deviance loss function. The algorithm is implemented as follows as described in (Friedman et al.,2001). Input : training set (Xi, yi), where i = 1....n , Xi ∈ H ⊆ Rn and yi ∈ [−1, 1] differential loss function L(y, F(X)) which in our case is Binomial deviance loss function defined as log(1 + exp(−2yF(X))) and M are the number of iterations . 1. Initialize model with a constant value: F0(X) =arg min γ n i=1 L(yi, γ). 2. For m = 1 to M: (a) Compute the pseudo-responses: rim = − ∂L(yi,F(Xi)) ∂F(Xi) F(X)=Fm−1(X) for i = 1, . . . , n. (b) Fit a base learnerhm(X) to pseudo-response, train the pseudo response using the training set {(Xi, rim)}n i=1. (c) Compute multiplier γm by solving the optimization problem: γm = arg min γ n i=1 L (yi, Fm−1(Xi) + γhm(Xi)). (d) Update the model: Fm(X) = Fm−1(X) + γmhm(X). 3. Output FM (X) = M m=1 γmhm(X) The value of the weight γm is found by an approximated newton raphson solution given as γm = Xi∈hm rim Xi∈hm|rim|(2−|rim|) 3.3.2 AdaBoost Classifier In adaBoost we assign (non-negative) weights to points in the data set which are normalized, so that it forms a distribution. In each iteration, we generate a training set by sampling from the data using the weights, i.e. the data point (Xi, yi) would be chosen with probability wi, where wi is the current weight for that data point. We generate the training set by such repeated independent sampling. After learning the current classifier, we increase the (relative) weights of data points that are misclassified by the current classifier. We generate a fresh training set using the modified weights and so on. The final classifier is essentially a weighted majority
  • 52. 29 voting by all the classifiers. The description of the algorithm as in (Freund et al., 1995) is given below: Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn , yi ∈ [−1, 1] 1. Initialize: wi(1) = 1 n , ∀i, each data point is initialized with equal weight, so when data points are sampled from the probability distribution the chance of getting the data point in the training set is equally likely. 2. We assume that there as M classifiers within the Ensembles. For m=1 to M do (a) Generate a training set by sampling with wi(m). (b) Learn classifier hm using this training set. (c) let ξm = n i=1 wi(m) I[yi=hm(Xi)] where IA is the indicator function of A and is defined as IA = 1 if [yi = hm(Xi)] IA = 0 if [yi = hm(Xi)] so ξm is the error computed due to the mth classifier. (d) Set αm=log(1−ξm ξm ) computed hypothesis weight, such that αm > 0 be- cause of the assumption that ξ < 0.5. (e) Update the weight distribution over the training set as wi(m + 1)= wi(m) exp(αmI[yi=hm(Xi)]) Normalization of the updated weights so that wi(m+1) is a distribution. wi(m + 1) = wi(m+1) i wi(m+1) end for 3. Output is final vote h(X) = sgn( M m=1 αmhm(x)) is the weighted sum of all classifiers in the ensemble. In the adaboost algorithm M is a parameter. Due to the sampling with weights, we can continue the procedure for arbitrary number of iterations. Loss function used in adaboost algorithm is exponential loss function and for a particular data point its defined as exp(−yif(Xi))
  • 53. 30 3.3.3 Random Forest Classifiers Random forests are a combination of tree predictors, such that each tree depends on the values of a random vector sampled independently, and with the same dis- tribution for all trees in the forest. The main difference between standard decision trees and random forest is, in decision trees, each node is split using the best split among all variables and in random forest, each node is split using the best among a subset of predictors randomly chosen at that node. In random forest classifier ntree bootstrap samples are drawn from the original data, and for each obtained bootstrap sample grow an unpruned classification decision tree, with the following modification: at each node, rather than choosing the best split among all predic- tors, randomly sample mtry of the predictors and choose the best split from among those variables. Predict new data by aggregating the predictions of the ntree trees (i.e., majority votes for classification). The algorithm is described as follows as in(Brieman, 2001): Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn , where D is the whole dataset. for i=1,...,B: 1. Choose a boostrap sample Di from D. 2. Construct a decision Tree Ti from the bootstrap sample Di such that at each node, choose a random subset of m features and only consider splitting on those features. Finally given the testdata Xt take the majority votes for classification. Here B is the number of bootstrap data sets generated from original data set D. 3.4 Multilayer Feed Forward with Back Propa- gation using word embedding approach In this approach word embedding framework was used to convert word to vectors and followed by applying MFN to classify the business event dataset. Genism
  • 54. 31 module in python was used to build this word embedding, using training of the words on CBOW(continuous bag of words model) or skip gram model of the un- supervised neural language model (Tomas Mikolov et al.,2013), where each word is assigned with an uniformly distributed (U[-1,1]) 100 to 300 dimensonal vector. Once we have initialized vectors for the each word using word embedding, using window based approach, we can convert word vectors into a single global sen- tence vector. The obtained global sentence vector is fed into MFN network with back-propagation for classification of the sentences using soft-max classifier. The following is implementation of the algorithm: 1. Initialization of each word in a sentence with a uniformly distributed (U[- 1,1]) dense vector of 100 to 300 dimension. 2. From a given set of words within a sentence, we concatenate word-embedding vectors to form an matrix for that particular sentence. 3. Choosing an appropriate window size on the obtained matrix and corre- spondingly applying max-pooling approach based on the window size we finally obtain a global sentence vector. 4. The obtained global sentence vectors are fed into multilayer feed forward network with back propagation using soft-max as the loss function. For regularization of the multilayer feed forward network and to avoid overfitting of the data, dropout mechanism is adopted. 3.5 Convolutional Neural Networks for Sentence Classification with unsupervised feature vec- tor learning In this model a simple CNN is trained with one layer of convolution on top of word vectors obtained from an unsupervised neural language model(Yoon kin, 2014). These vectors were trained by (Mikolov et al.,2013) on 100 billion words
  • 55. 32 Figure 3.1: The Image describes the architecture for Convolutional Neural Network with Sentence Modelling for multichannel architecture of Google news, and is a publicly available model. The following figure (3.1) de- scribes the architecture of the CNN for sentence modeling. let N be the number of sentences in the vocabulary and n be the number of words in the particular sentence, where xi ∈ Rk be the k-dimensional word vector corre- sponding to the i-th word in the sentence. A sentence of length n (padded where necessary) is represented as x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concate- nation of words xi , xi+1 , . . . , xi+j. The weight vector w is initialized with
  • 56. 33 a random uniformly distributed matrix of size Rh×k . A convolution operation involves a filter weight matrix w, which is applied to a window of h words of a par- ticular sentence to produce a new feature. For example, a feature ci is generated from a window of words xi:i+h−1 by ci = f(w · xi:i+h−1 + b). Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sentence [x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map. c = [c1, c2, ..., cn−h+1] with c ∈ Rn−h+1 , We then apply a max-pooling operation over the feature map and take the maximum value c∗ = max[c] as the feature corresponding to this particular filter. The idea is to capture the most important feature one with the highest value for each feature map. This pooling scheme naturally deals with variable sentence lengths. We have described the process by which one feature is extracted from one filter. The model uses multiple filters (with varying window sizes) to obtain multiple features. These features are also called as unsupervised features, because they are obtained by applications of different filters with variable window sizes randomly. These features form the penultimate layer and are passed to a fully connected soft-max layer whose output is the probability distribution over labels. To avoid overfitting of CNN models, drop-out mechanism is adopted. 3.5.1 Variations in CNN sentence models CNN-rand: Our baseline model where all words are randomly initialized and then modified during training. CNN-static: A model with pre-trained vectors from word2vec. All words in- cluding the unknown ones that are randomly initialized are kept static and only the other parameters of the model are learned. Initializing word vectors with those
  • 57. 34 obtained from an unsupervised neural language model is a popular method to im- prove performance in the absence of a large supervised training set. We use the publicly available word2vec vectors that were trained on 100 billion words from Google news. The vectors have dimensionality of 300 and were trained using the continuous bag-of-words architecture (Mikolov et al., 2013). Words not present in the set of pre-trained words are initialized randomly.
  • 58. Chapter 4 Results and Discussions In this chapter we discuss about the results obtained from the machine learning algorithms that were applied in our work. 1. Semi-supervised learning approach using naive Bayes with expectation-maximization and active learning with QBC to increase the number of labeled data points. 2. The ensemble classifiers, MFN and CNN models to classify the obtained business data. Described below are the results and analysis of the algorithms. 4.1 Semi-supervised Learning Implementation us- ing Naive Bayes with Expectation Maximiza- tion Initially we had few data points which were labeled in supervised manner. To formulate and solve the problem as Business Event Classification problem, our primary objective was to increase the number of labeled data points. In accordance with the algorithm of semi-supervised learning using naive Bayes classifier with expectation maximization explained in section 3.1, the following are the results in three domains of acquisition, vendor-supplier and job events with the training data taken as 30%, 40% and 50% of the whole dataset and rest of the pool as test data. 35
  • 59. 36 4.1.1 Results and Analysis of Vendor-Supplier Event Data Vendor-supplier data points labeled in supervised manner were 754. Stated below are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.1) and figure (4.1) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.2),(4.4) and(4.6) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and the remaining corresponding part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.3),(4.5) and (4.7) displays the ROC curves for variations of 30% , 40% and 50 % as training data and corresponding remaining part as the test data. Analysis: We observe an increase in accuracy and F-scores as there is an in- crease in the number of training data points which is as expected. But increase in accuracies are of higher values compared to the increase in F-scores, because number of true negatives are more in number compared to true positives. The confusion matrix plot shows slight variations in number of true positives and true negatives as the number of training data points are increased. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 60. 37 Table 4.1: Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for vendor-supplier data Semi-supervised learning using naive Bayes for vendor-supplier dataset Training data points in per- centage Accuracy F-scores Description on dataset 30 0.5597 0.5915 Testing data=527,training data=227 40 0.7434 0.65 Testing data=454,training data=300 50 0.7765 0.674 Testing data=376,training data=376 Figure 4.1: Variations in Accuracies and F1-scores for Vendor-supplier data using Naive-Bayes, semi-supervised technique
  • 61. 38 Figure 4.2: Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for VNSP Figure 4.3: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for VNSP
  • 62. 39 Figure 4.4: Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for VNSP Figure 4.5: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for VNSP
  • 63. 40 Figure 4.6: Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for VNSP Figure 4.7: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for VNSP
  • 64. 41 4.1.2 Results and Analysis for Job Event Data Job event data points labeled in supervised manner were 2810 data points. Stated below some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.2) and figure (4.8) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.9),(4.11) and(4.13) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.10),(4.12) and (4.14) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: As there is an increase in the number of training data, we observe an increase in accuracy and F-scores. But there is a vast difference in values of accuracies compared to F-scores, because the number of true negatives are very high in comparison to true positives which are low in number, which is clearly visible in our confusion matrix plot. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 65. 42 Table 4.2: Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Job event data Semi-supervised learning using naive Bayes for Job dataset Training data points in per- centage Accuracy F-scores Description on data 30 0.7483 0.4444 Testing data=1967,training data=842 40 0.7544 0.4863 Testing data=1686,training data=1123 50 0.8014 0.52 Testing data=1405,training data=1404 Figure 4.8: Variations in Accuracies and F1-scores for Job event data using Naive-Bayes, semi-supervised technique
  • 66. 43 Figure 4.9: Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for JOB Figure 4.10: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for JOB
  • 67. 44 Figure 4.11: Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for JOB Figure 4.12: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for JOB
  • 68. 45 Figure 4.13: Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for JOB Figure 4.14: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for JOB
  • 69. 46 4.1.3 Result and Analysis for Acquisition Event Data Acquisition event data points labeled in supervised manner were 1380 data points. Stated below are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.3) and figure (4.15) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.16),(4.18) and(4.20) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false-positives and false-negatives. Figures (4.17),(4.19) and (4.21) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: There is an increase in the accuracy and F-scores as there is increase in the number of training data points. Increase in F-scores are slightly higher compared to the increase in accuracies. Because number of true positives are more in compared to true negatives, due to this classifier is more biased towards the positive class compared to negative. So the amount of false positives are higher in this scenario, which is clearly visible from the confusion matrix plots. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 70. 47 Table 4.3: Variation in accuracies and F-scores in Semi-supervised learning using naive Bayes for Acquisition event data Semi-supervised learning using naive Bayes for Acquisition dataset Training data points in per- centage Accuracy F-scores Description on data 30 0.7929 0.8178 Testing data=966,training data=413 40 0.7989 0.82 Testing data=828,training data=521 50 0.8057 0.8241 Testing data=689,training data=690 Figure 4.15: Variations in Accuracies and F1-scores for Acquisition event data using Naive-Bayes, semi-supervised technique
  • 71. 48 Figure 4.16: Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Acquisition Figure 4.17: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Acquisition
  • 72. 49 Figure 4.18: Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Acquisition Figure 4.19: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Acquisition
  • 73. 50 Figure 4.20: Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Acquisition Figure 4.21: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Acquisition
  • 74. 51 4.2 Active Learning implementation by Query by committee approach In accordance with the algorithm of active learning explained in section 3.1.2, following are some of the results in three domains of acquisition, vendor-supplier and job events with the training data taken as 30% ,40% and 50 % of the whole dataset and prediction of test data using majority voting of three ensemble clas- sifiers gradient boosting classifier, ada boost classifier and random forest classifier (i.e. query by committee approach). 4.2.1 Results and Analysis for Vendor-Supplier Event Data Vendor-supplier data points labeled in supervised manner were 754 data points. Following are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.4) and figure (4.22) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.23),(4.25) and(4.27) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.24),(4.26) and (4.28) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: We observe an increase in accuracy and F-scores, as there is increase in the number of training data. But increase in accuracies are of higher values compared to the increase in F-scores because number of true negatives are more in compared to true positives. This method performs better compared to the semi- supervised naive Bayes classifier. The confusion matrix plot shows slight variations in number of true positives and true negatives as the number of training data points are increased. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints.
  • 75. 52 Table 4.4: Variation in accuracies and F-scores using Active Learning for Vendor-supplier event data Active Learning using QBC approach Training data points in per- centage Accuracy F-scores Description on data 30 0.842 0.7348 Testing data=529,training data=225 40 0.84 0.7352 Testing data=454,training data=300 50 0.8643 0.76 Testing data=376,training data=376 Figure 4.22: variations in Accuracies and F1-scores for Vendor-supplier data using Active learning
  • 76. 53 Figure 4.23: Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier Figure 4.24: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Vendor-supplier
  • 77. 54 Figure 4.25: Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Vendor-Supplier Figure 4.26: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Vendor-supplier
  • 78. 55 Figure 4.27: Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier Figure 4.28: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Vendor-supplier
  • 79. 56 4.2.2 Result and Analysis for Job Event Data Job event data points labeled in supervised manner were 2809 data points. Fol- lowing are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data.Table (4.5) and figure (4.29) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.30),(4.32) and(4.34) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.31),(4.33) and (4.35) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: As there is increase in the number of training data, we observe an increase in accuracy and F-scores. But there is a vast difference in accuracies compared to F-scores, because number of true negatives are very high in compared to true positives which are low in number which is clearly visible in our confusion matrix plot. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of training datapoints. The performance of this method is better compared to the semi-supervised naive Bayes classifier which is clearly visible from our results.
  • 80. 57 Table 4.5: Variation in accuracies and F-scores using Active Learning for Job event data Active Learning using QBC approach Training data points in per- centage Accuracy F-scores Description on data 30 0.9054 0.6204 Testing data=1967,training data=842 40 0.9116 0.6558 Testing data=1686,training data=1123 50 0.9216 0.6758 Testing data=1405,training data=1404 Figure 4.29: Variations in Accuracies and F1-scores for Job event data using Active learning
  • 81. 58 Figure 4.30: Confusion matrix for large pool of testing data of 70 percent and training data of 30 percent for Job Figure 4.31: Roc curve for large pool of testing data of 70 percent and training data of 30 percent for Job
  • 82. 59 Figure 4.32: Confusion matrix for large pool of testing data of 60 percent and training data of 40 percent for Job Figure 4.33: Roc curve for large pool of testing data of 60 percent and training data of 40 percent for Job
  • 83. 60 Figure 4.34: Confusion matrix for large pool of testing data of 50 percent and training data of 50 percent for Job Figure 4.35: Roc curve for large pool of testing data of 50 percent and training data of 50 percent for Job
  • 84. 61 4.2.3 Result and Analysis for Acquisition Event Data Acquisition event data points labeled in supervised manner were 1380 data points. Following are some of the observations made for large pool of unlabeled test data, by varying the data points in test data and train data. Table (4.6) and figure (4.36) shows the variations of accuracies and F-scores for 30% , 40% and 50 % as training data and corresponding remaining part as the test data. The figures (4.37),(4.39) and(4.41) displays the confusion matrix for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. The confusion matrix gives insights regarding the number of true-positives, true-negatives, false- positives and false-negatives. Figures (4.38),(4.40) and (4.42) displays the ROC curves for variations of 30% , 40% and 50 % as the training data and corresponding remaining part as the test data. Analysis: There is an increase in the accuracy and F-scores as there is increase in the number of training data points. Increase in F-scores are equivalent to the increase in accuracies. The confusion matrix plots show that the number of true positives and true negatives are nearly equal in number. The ROC curve shows an increase in TPR and area under the ROC curve for increase in the number of train- ing datapoints. This method shows slight improvement in accuracies compared to the semi-supervised naive Bayes classifier.