3. Problem Statement Continued…
A baseline code was provided by Organisers
having
precision : 96.06%
F1 Measure: 42.09
and categorizing named entities into
following categories
- Company - facility - geo-loc
- movie - musicartist - other
- person - product - sportsteam
- tvshow
Goal: Improve the precision and F1 Measure
of the baseline code
4. Baseline Code Review
Train and test data each containing 500 tweets
Lexicons for people first name, english stop words, product
names, location database, sports team, tv programs
A python code is used to generate the feature in format
required by CRFSuite ()
CRFSuite generates model using the training data, and
dumps the model in txt format
CRFSuite tag mode is then used on the test data to extract
named entities.
Perl script did the job of evaluation
5. Crfsuite (Averaged Perceptron)
Crfsuite uses Averaged Perceptron algorithm
This algorithm takes the average of feature
weights at all updates in the training process.
The algorithm is fastest in terms of training
speed(as compared to l2sgd: Stochastic Gradient
Descent (SGD) with L2 regularization).
Even though the algorithm is very simple, it
exhibits high prediction performance. In
practice, it is necessary to stop a training
process by specifying the maximum number
of iterations (120 in our case).
6. Changes done
1. Code changes : Logical bug fixed
python code didn’t actually considered contiguous words
to
extract the phrase features using windowing approach.
Fixing
this boosted the precision by .3% to 96.57%
2. Non-code changes :
- Wikipedia titles were heavily pruned and added as
lexicons,
which boosted the precision to 96.24 % i.e. by a factor
of .2%
- OpenData from gov websites like world university
names,
geographical data like river names, company names was
also