Ire project ner-team55-spring16-iiith

Project: Named Entity Extraction in
Twitter
- Md Tareque Khan (201505521)
- Sourav Sarangi (201301014)
- Darshan Agarwal (201225189)
[Team 55] [April 2016]
Information Retrieval and Extraction (CSE474) Spring ‘16
Professor: Vasudeva Verma
Mentor: Priyanka Bajaj

Problem Statement
https://noisy-text.github.io/ner-shared-
task.html

Problem Statement Continued…
A baseline code was provided by Organisers
having
precision : 96.06%
F1 Measure: 42.09
and categorizing named entities into
following categories
- Company - facility - geo-loc
- movie - musicartist - other
- person - product - sportsteam
- tvshow
Goal: Improve the precision and F1 Measure
of the baseline code

Baseline Code Review
 Train and test data each containing 500 tweets
 Lexicons for people first name, english stop words, product
names, location database, sports team, tv programs
 A python code is used to generate the feature in format
required by CRFSuite ()
 CRFSuite generates model using the training data, and
dumps the model in txt format
 CRFSuite tag mode is then used on the test data to extract
named entities.
 Perl script did the job of evaluation

Crfsuite (Averaged Perceptron)
Crfsuite uses Averaged Perceptron algorithm
This algorithm takes the average of feature
weights at all updates in the training process.
The algorithm is fastest in terms of training
speed(as compared to l2sgd: Stochastic Gradient
Descent (SGD) with L2 regularization).
Even though the algorithm is very simple, it
exhibits high prediction performance. In
practice, it is necessary to stop a training
process by specifying the maximum number
of iterations (120 in our case).

Changes done
1. Code changes : Logical bug fixed
python code didn’t actually considered contiguous words
to
extract the phrase features using windowing approach.
Fixing
this boosted the precision by .3% to 96.57%
2. Non-code changes :
- Wikipedia titles were heavily pruned and added as
lexicons,
which boosted the precision to 96.24 % i.e. by a factor
of .2%
- OpenData from gov websites like world university
names,
geographical data like river names, company names was
also

Base Output
processed 11570 tokens with 356 phrases; found: 244 phrases; correct: 128.
accuracy: 96.07%; precision: 52.46%; recall: 35.96%; FB1: 42.67
company: precision: 72.41%; recall: 51.22%; FB1: 60.00 29
facility: precision: 40.00%; recall: 30.00%; FB1: 34.29 15
geo-loc: precision: 64.44%; recall: 50.00%; FB1: 56.31 45
movie: precision: 11.11%; recall: 33.33%; FB1: 16.67 9
musicartist: precision: 16.67%; recall: 8.33%; FB1: 11.11 6
other: precision: 35.00%; recall: 11.48%; FB1: 17.28 20
person: precision: 60.44%; recall: 47.01%; FB1: 52.88 91
product: precision: 26.67%; recall: 22.22%; FB1: 24.24 15
sportsteam: precision: 25.00%; recall: 16.67%; FB1: 20.00 12
tvshow: precision: 50.00%; recall: 12.50%; FB1: 20.00 2

Build Output
processed 11570 tokens with 356 phrases; found: 261 phrases; correct: 157.
accuracy: 96.57%; precision: 60.15%; recall: 44.10%; FB1: 50.89
company: precision: 70.97%; recall: 53.66%; FB1: 61.11 31
facility: precision: 50.00%; recall: 30.00%; FB1: 37.50 12
geo-loc: precision: 68.09%; recall: 55.17%; FB1: 60.95 47
movie: precision: 33.33%; recall: 33.33%; FB1: 33.33 3
musicartist: precision: 12.50%; recall: 8.33%; FB1: 10.00 8
other: precision: 42.42%; recall: 22.95%; FB1: 29.79 33
person: precision: 68.27%; recall: 60.68%; FB1: 64.25 104
product: precision: 62.50%; recall: 27.78%; FB1: 38.46 8
sportsteam: precision: 40.00%; recall: 22.22%; FB1: 28.57 10
tvshow: precision: 20.00%; recall: 12.50%; FB1: 15.38 5

External Links
http://www.chokkan.org/software/crfsuite/
https://noisy-text.github.io/ner-shared-
task.html
Papers referred to
 TwiNER: Named Entity Recognition in Targeted Twitter Stream [Chenliang
Li1, Jianshu Weng2]
 Named Entity Recognition in Tweets: An Experimental Study [Alan Ritter,
Sam Clark, Mausam and Oren Etzioni]

Ire project ner-team55-spring16-iiith

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Ire project ner-team55-spring16-iiith