mohan-sc13m055

Business Event Recognition
From Online News Articles
Machine Learning Graduate Program
Mohan Kashyap.P
SC13M055
Supervisor: Dr. Sumitra.S
Department of Mathematics
IIST
Mentor: Mahesh CR
CEO
Tataatsu Idealabs
May 18, 2015

Basic Overview Data Extraction,Data Pre-processing and Feature Engineering Algorithms and Results Conclusion
1
Acknowledgement
TATAATSU IDEALABS for allowing me to carry out my thesis
work.
Business Event Recognition From Online News Articles Mohan Kashyap P

2
Tataatsu Idealabs
• An organization which works on two main products
Collablayer and Disquery
• Disquery is NLP analytics engine that extracts semantic
signals and identiﬁes pattern from unstructured text. Quicker
insight of data helps to make better decisions.
• Busniess event recognition falls under the category of
Disquery.

3
Illustrative Example

4
Outline
1 Basic Overview
Introduction
2 Data Extraction,Data Pre-processing and Feature Engineering
Data Extraction
Text To Numeric Conversion
3 Algorithms and Results
Semi-Supervised Techniques
Machine Learning Approach
Unsupervised Feature Vector Learning Approach
4 Conclusion
Challenges Encountered
Future Work
References
Appendix

Introduction 5
Introduction
• The project work deals with the identiﬁcation of business
events.
• The process starts from crawling of data.
• Followed by labeling of the extracted data.
• Further on, application of data-preprocessing and feature
engineering techniques.
• Later doing evaluation by machine learning approaches.

Introduction 6
Objective
• Given an online article or content of interest from the end
user.
• The developed automated model must predict whether the
given document contains a business event or not.
• Business events in our scenario are restricted to merger and
acquistion, vendor-supplier and job-event.
• If model predicts as business event then it has to give out
additional information.
• Additional information is the ’entities’ i.e. organizations and
persons involved.

Introduction 7
Motivation
• Major business events happen everyday around the globe.
• An organization as a competitor will be interested to
understand the analytics of the another organization
• To develop better business strategies.
• Enhance decision making which leads to development and
growth of the organization.

Introduction 8
Related Works
• The paper close to our work is Recognition of Named-Event
Passages in News Articles[1].
• This paper describes about the method for ﬁnding named
events:
• In violent behaviour and business domains.
• In business domain it describes about:
• Management changes, mergers and acquisitions, strikes, legal
troubles and bankruptcy.

Data Extraction 9
Extraction And Labeling Of The Data
• Crawlers were written to extract the business event data.
• Using NLP the gathered data was split into sentences using
the sentence tokenizer.
• Three types of classes labeled were acquistion, vendor supplier
and job.

Data Extraction 10
Data Description
Vendor-Supplier Event Data
Title : Tri-State signs agreement with NextEra Energy Resources for new wind
facility in eastern Colorado; WESTMINSTER, Colo., Feb. 5, 2014
/PRNewswire/ – Tri-State Generation and Transmission Association, Inc.
announced that it has entered into a 25-year agreement with a subsidiary of
NextEra Energy Resources, LLC for a 150 megawatt wind power generating
facility to be constructed in eastern Colorado,in the service territory of
Tri-State member cooperative K. C. Electric Association (Hugo, Colo.)
Acquistion Event Data
Sun Pharmaceutical Industries announced on Monday that it would acquire
troubled rival Ranbaxy Laboratories in a USD 4-billion deal that includes USD
800 million debt.
Job Event Data
Bank of America Merrill Lynch has hired Tristan Cheesman as head of
European ABS syndicate according to a source.

Data Extraction 11
Data Pre-processing And Feature Engineering
• Cleansing of the tagged sentences by removing of stopwords
and special symbols.
• Building of hand crafted features by observing the data
pattern.
• Type1 features - Captures the semantics and pattern in the
data[2].
• Type2 features - Entity type features.
• Type3 features - Rhetorical features.

Data Extraction 12
Type1 Features
• Nouns and Noun phrases
example: Title agreement Next Era Energy wind facility
eastern Colorado WESTMINSTER Colo. Feb. Generation
Transmission Association Inc.
• Capital words
example: WESTMINSTER LLC K. C.
• Pattern of POS tags adjective-noun, adjective-adjective-noun
format
example: new wind 25-year agreement Tri-State member

Data Extraction 13
Type2 Features
• Organization names
example : K. C. Electric Association NextEra Energy Resource
• Organization references
example : k. c. electric association nextera energy resources
• Location
example : WESTMINSTER Colo. Colorado
• Person
example : Jack stone

Text To Numeric Conversion 14
Bag Of Words Approach
• Data obtained from pre-processing and feature engineering
has to be converted into vectors.
• Bag of words is a method used to convert word to vectors[9].
• The two vectorizers used under this method are count
vectorizes and tf-idf vectorizers.
• Count-vectorizers: use counts of the words to convert them
into vectors.
• TF-IDF vectorizers: converts word into vectors based on
importance of each word in the sentence.

Illustration Of Count And Tf-Idf Vectorizers
• Count vectorizer illustration : document=[[John likes to
watch movies. Mary likes movies too] ; [John also likes to
watch football games.]]
sentence1 : [0,0,0,1,2,1,2,1,1,1]
sentence2 : [1,1,1,1,1,0,0,1,0,1]
• Tf-idf vectorizer illustration : TF(movies,sentence1)
=1 + log(2)= 1.3010
IDF(movies,document)= log(2
1) = 0.3010
TF-IDF = TF(movies, sentence1) × IDF(movies, document)
= 1.3010 × 0.3010 = 0.3916

Word-Embedding
• In this method each word is represented by a 100 to 300
dimensional vector[8].
• The representation is word with vector is of two types.
• uniformly distributed variable U[-1,1].
• pre-trained word vectors.

Semi-Supervised Techniques 17
Naive Bayes With Expectation Maximization[3]
• Train naive Bayes classiﬁer using the labeled data.
• Predict the probablistic labels.
• Retrain the classiﬁer using this probablistic labels.
• Repeat this process until convergence.

Results For Vendor-Supplier Event Dataset
Table : Variation in accuracies and F-scores in semi-supervised learning
using naive Bayes for Vendor-supplier data
Semi-supervised learning using naive Bayes for vendor-supplier dataset
Training data
points in percent-
age
Accuracy F-scores Description on dataset
30 0.5597 0.5915 Testing data=527,training
data=227
data=300
data=376

Results For Job Event Dataset
using naive Bayes for Job event data
Semi-supervised learning using naive Bayes for Job dataset
Training data
points in percent-
age
Accuracy F-scores Description on data
data=842
data=1123
data=1404

Results For Acquistion Event Data Set
using naive Bayes for Acquisition event data
Semi-supervised learning using naive Bayes for Acquisition dataset
Training data points in percent-
age
data=413
data=521
data=690

Active Learning
• Active learning was implemented using query by committee
approach[10].
• The classifiers used in the committee were ada boost classifier,
random forest classifier and gradient boosting classifier.
• This method performed better compared to semi-supervised
navie Bayes classifier.

Results For Vendor-Supplier Event Dataset
Table : Variation in accuracies and F-scores using Active Learning for
Vendor-supplier event data
Active Learning using QBC approach
age
data=225
data=300
data=376

Results For Job Event Dataset
Table : Variation in accuracies and F-scores using Active Learning for Job
event data
age
data=842
data=1123
data=1404

Results For Acquistion Event Dataset
Table : Variation in accuracies and F-scores using Active Learning for
Acquisition event data
age
data=413
data=521
data=690

Machine Learning Approach 25
Ensemble Classifers With Bag Of Words For Business
Event Classification
• The classifiers used were ada boosting classifier[6], random
forest classifer[7] and gradient boosting classifier[5].
• The final prediction was performed by voting of these three
classifiers.
• The base-learner used was decision-trees.
• The number of base-learners used were 500.

Results For Vendor-Supplier Data Set With Parameter As
500
Test score for Vendor-supplier using voting of three ensemble classiﬁers with number of estimators as 500
Area under ROC Accuracy F-scores Confusion matrix values
88% 91.97% 85.211% truepositives=196,falsepositives=16,
truenegatives=583,falsenegatives=52

Results For Job Data Set With Parameter As 500
Test score for Job using voting of three ensemble classiﬁers with number of estimators as 500
87.56% 92.3% 83.88% truepositives=149,falsepositives=16,

Results For Acquistion Data Set With Parameter As 500
Test score for Acquisition using voting of three ensemble classiﬁers with number of estimators as 500
92% 94.21% 91.10% truepositives=245,falsepositives=8,

Peformance Measure Analysis For Vendor-Supplier On The
Whole-Dataset
Average accuracy and F1-score
Classiﬁer Accuracy F-scores
Gradient-Boost 0.9063 0.8277
Ada-boost 0.8968 0.8154
Random forest 0.9057 0.8254

Performance Measure Analysis For Acquistion On The
Whole-Dataset
Ada-boost 0.9398 0.9021

Peformance Measure Analysis For Job On The
Whole-Dataset
Ada-boost 0.9006 0.8088

Unsupervised Feature Vector Learning Approach 32
Multilayer Feed Forward Network With Word Embedding
• Each word was intialized with U[-1,1] variate of 100 dimesion.
• For each of the sentences a word-embedding matrix was
developed.
• Window approach with max-pooling was applied on this
matrix to convert it into sentence vector.
• The sentence vector was fed into MFN for classiﬁcation.
• The performance of this method was satisfactory.
Table : Variation in test score for MFN with word embedding
Test score for MFN with word embedding on vendor-supplier dataset
Accuracy F-score Confusion matrix values
0.65 0.39 Truenegatives=140, Truepositive =
13,falsepositives = 3, falsenegatives
= 69

CNN Used For Sentence Modeling With Word-Embedding
Approach[4]
Figure : The Image describes the architecture for Convolutional Neural
Network with Sentence Modeling.

Experimental Setup Up For CNN Sentence Modeling
• Shape of the input matrix for vendor-supplier was 2515×300
• For job event it was 1192×300 and for acquistion 580×300.
• The ﬁlter shapes used to extract features were 3×300, 4×300
and 5×300.
• The dimension of the hidden units was 100×2 dimension.
• The activation function used was RELU.

Results For CNN
• For the vendor supplier data overall average accuracy for
CNN-rand was 0.9081 and for CNN-word2vec was 0.9167.
• For the Acquistion data overall average accuracy for
CNN-rand was 0.9359 and for CNN-word2vec was 0.9657.
• For the Job event data overall average accuracy for CNN-rand
was 0.8046 and for CNN-word2vec was 0.8108.

Challenges Encountered 36
Challenges Encountered
• Uncertainty in data extraction.
• Business event datasets were unstructured.
• Bag of words vectorizers fail to capture the exact meaning of
the word.
• Application of active learning methods was time consuming.

Summary
• An automated model for recognizing business events in
respective business domains was developed.
• Tf-idf vectorizers performed better compared to the
count-vectorizers.
• All the three ensemble classiﬁers showed good performance.
• CNN-word2vec models performed better compared to the
CNN-rand models.

Summary
• In the acquisition dataset CNN models perform better
compared to the ensemble classifiers.
• In vendor-supplier dataset CNN models perform slightly better
compared to the ensemble classifiers.
• In job event dataset ensemble classifiers perform better
compared to the CNN models.

Future Work 39
Future work
• The problem of co-reference resolution.
• Application of HMM.
• Extending business events classiﬁcation for more number of
domains.

References 40
[1] Marujo, Luis, Wang Ling, Anatole Gershman, Jaime Carbonell, and João P. Neto2 David Matos.
Recognition of Named-Event Passages in News Articles. In 24th International Conference on Computational
Linguistics, pp.321-329. 2012.
[2] Marujo, Luis, Anatole Gershman, Jaime Carbonell, Robert Frederking, and João P. Neto. Supervised topical
key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization.In
proceedings of 8th international conference on Language Resources and Evaluvation(LREC) ,pp.156-162.
2012.
[3] Nigam, Kamal, Andrew McCallum, and Tom Mitchell. Semi-supervised text classification using EM.
Semi-Supervised Learning,pp 33-56. 2006.
[4] Kim, Yoon. Convolutional Neural Networks for Sentence Classification.Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746-1751. 2014.
[5] Friedman, Jerome H.Greedy function approximation: a gradient boosting machine. Annals of statistics:pp
1189-1232. 2001.
[6] Freund, Yoav, and Robert E. Schapire. A desicion-theoretic generalization of on-line learning and an
application to boosting. In Computational learning theory, pp. 23-37. Springer Berlin Heidelberg, 1995.
[7] Breiman, Leo. Random forests. Machine learning 45, no. 1 (2001),pp. 5-32. 2001.
[8] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in
vector space. arXiv preprint arXiv:1301.3781 (2013).
[9] Harris, Zellig S.Distributional structure.Word, Vol 10, 1954, pp. 146-162.
[10] Abe, N., and Mamitsuka, H. Query learning strategies using boosting and bagging. Proceedings of 15th
International Conferenec on Machine Learning (ICML-98), pp. 1-10. 1998.

References 41
http://127.0.0.1:5000/ Link

Appendix 42
Ada-boost
In adaBoost we assign (non-negative) weights to points in the data set which are normalized, so that it forms a
distribution. In each iteration, we generate a training set by sampling from the data using the weights, i.e. the data
point (Xi , yi ) would be chosen with probability wi , where wi is the current weight for that data point. We
generate the training set by such repeated independent sampling. After learning the current classifier, we increase
the (relative) weights of data points that are misclassified by the current classifier. We generate a fresh training set
using the modified weights and so on. The final classifier is essentially a weighted majority voting by all the
classifiers. The description of the algorithm as in (Freund et al., 1995) is given below:
Input n examples: (X1, y1), ..., (Xn, yn), Xi ∈ H ⊆ Rn
, yi ∈ [−1, 1]
1 Initialize: wi (1) = 1
n
, ∀i, each data point is initialized with equal weight, so when data points are sampled
from the probability distribution the chance of getting the data point in the training set is equally likely.
2 We assume that there as M classifiers within the Ensembles.
For m=1 to M do
1 Generate a training set by sampling with wi (m).
2 Learn classifier hm using this training set.
3 let ξm = n
i=1 wi (m) I[yi =hm(Xi )] where IA is the indicator function of A and is defined as
IA = 1 if [yi = hm(Xi )]
IA = 0 if [yi = hm(Xi )]
so ξm is the error computed due to the mth classifier.
4 Set αm=log( 1−ξm
ξm
) computed hypothesis weight, such that αm > 0 because of the assumption
that ξ < 0.5.
5 Update the weight distribution over the training set as
wi (m + 1)= wi (m) exp(αmI[yi =hm(Xi )])
Normalization of the updated weights so that wi (m + 1) is a distribution. wi (m + 1) =
wi (m+1)
i w
i
(m+1)
end for

Appendix 43
Output is final vote h(X) = sgn( M
m=1 αmhm(x)) is the weighted
sum of all classifiers in the ensemble.
In the adaboost algorithm M is a parameter. Due to the sampling
with weights, we can continue the procedure for arbitrary number
of iterations. Loss function used in adaboost algorithm is
exponential loss function and for a particular data point its defined
as exp(−yi f (Xi ))

Appendix 44
Random forest classiﬁer
Input n examples: (X1, y1), ..., (Xn, yn) = D, Xi ∈ Rn, where D is
the whole dataset.
for i=1,...,B:
1 Choose a boostrap sample Di from D.
2 Construct a decision Tree Ti from the bootstrap sample Di
such that at each node, choose a random subset of m features
and only consider splitting on those features.
Finally given the testdata Xt take the majority votes for
classiﬁcation. Here B is the number of bootstrap data sets
generated from original data set D.

Appendix 45
Gradient boosting classifier
Boosting algorithms are set of machine learning algorithms, which builds strong classifier from set of weak
classifiers, typically decision tress. Gradient boosting is one such algorithm which builds the model in a stage-wise
fashion, and it generalizes the model by allowing optimization of an arbitrary differentiable loss function. The
differentiable loss function in our case is Binomial deviance loss function. The algorithm is implemented as follows
as described in (Friedman et al.,2001).
Input : training set (Xi , yi ), where i = 1....n , Xi ∈ H ⊆ Rn
and yi ∈ [−1, 1] differential loss function
L(y, F(X)) which in our case is Binomial deviance loss function defined as log(1 + exp(−2yF(X))) and M are the
number of iterations .
1 Initialize model with a constant value:
F0(X) =arg min
γ
n
i=1 L(yi , γ).
2 For m = 1 to M:
1 Compute the pseudo-responses:
rim = −
∂L(yi ,F(Xi ))
∂F(Xi ) F(X)=Fm−1(X)
for i = 1, . . . , n.
2 Fit a base learnerhm(X) to pseudo-response, train the pseudo response
using the training set {(Xi , rim)}n
i=1.
3 Compute multiplierγm by solving the optimization problem:
γm = arg min
γ
n
i=1 L yi , Fm−1(Xi ) + γhm(Xi ) .
4 Update the model: Fm(X) = Fm−1(X) + γmhm(X).
3 Output FM (X) = M
m=1 γmhm(X)
The value of the weight γm is found by an approximated newton raphson solution given as γm =
Xi ∈hm
rim
Xi ∈hm|rim|(2−|rim|)

Appendix 46
CNN
let N be the number of sentences in the vocabulary and n be the number of words in the particular sentence, where
xi ∈ Rk
be the k-dimensional word vector corresponding to the i-th word in the sentence. A sentence of length n
(padded where necessary) is represented as
x1:n = x1 ⊕ x2 ⊕ ... ⊕ xn
where ⊕ is the concatenation operator. In general, let xi:i+j refer to the concatenation of words xi , xi+1 , . . . ,
xi+j . The weight vector w is initialized with a random uniformly distributed matrix of size Rh×k
. A convolution
operation involves a filter weight matrix w, which is applied to a window of h words of a particular sentence to
produce a new feature. For example, a feature ci is generated from a window of words xi:i+h−1 by
ci = f (w · xi:i+h−1 + b).
Here b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to
each possible window of words in the sentence [x1:h, x2:h+1, ..., xn−h+1:n] to produce a feature map.
c = [c1, c2, ..., cn−h+1]
with c ∈ Rn−h+1
, We then apply a max-pooling operation over the feature map and take the maximum value
c∗
= max[c] as the feature corresponding to this particular filter. The idea is to capture the most important
feature one with the highest value for each feature map. This pooling scheme naturally deals with variable sentence
lengths. We have described the process by which one feature is extracted from one filter. The model uses multiple
filters (with varying window sizes) to obtain multiple features. These features are also called as unsupervised
features, because they are obtained by applications of different filters with variable window sizes randomly. These
features form the penultimate layer and are passed to a fully connected soft-max layer whose output is the
probability distribution over labels.
To avoid overfitting of CNN models, drop-out mechanism is adopted.

mohan-sc13m055

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (13)

Similar to mohan-sc13m055

Similar to mohan-sc13m055 (20)

mohan-sc13m055