CLTL presentation: training an opinion mining system from KAF files using CRF

OM, CRF and
KAF
Rubén Izquierdo Beviá
CLTL tutorial
11-July-2013

OM, CRF and KAF
1. OM
Develop an Opinion Miner tool
2. CRF
Using supervised Machine Learning (CRF)
3. KAF
Using KAF files as input

Opinion Miner
EXPRESSIO
N
TARGETHOLDER
 Detecting and extracting fine grained opinions in text.
 Opinion elements:
 Expression  the actual subjective statement
 Holder  mentions of whom the opinion is from
 Target  what the opinion is about
My wife said that the room was really dirty.

CRF
 Conditional Random Fields
 Statistical modeling method
 Obtain conditional probably distribution over sequences
 Suitable for segmenting and labeling structured data (sequences, trees…)
 Expressions, holders and targets are sequences
 Many different packages:
 Mallet (http://mallet.cs.umass.edu)
 CRFSuite (http://www.chokkan.org/software/crfsuite)
 Most used input format:
 Sequential data
 One token per line, represented by features

KAF
 KAF modified for OpeNER
 Different layers for different information
 All the features are extracted from the KAF
No external linguistics processors are called

First steps
 Define which will be our “output classes”
 Target, Holder, Positive, Negative
 Define which features will represent each token
 Token, lemma, pos, polarity, entity, polarity and bi/tri-
grams around
 Study the input format of your selected CRF package
(CRFSuite in my case)

CRFSuite input format
 Input format of CRFSuite
 One file with all data
 Sequences separated by empty lines
 One token per line with the format:
 CLASS [TAB] FEATS
 CLASS  O| B-class | I-class
 O  no class
 B-class  the first element of a sequence of type “class”
 I-class  element inside of a sequence of type “class”
 FEATS  feat1=val1 [TAB] feat2=val2 …
B-NP a=He b=reckons c=the d=He|reckons e=P

Simple Example
NP NPVP
B-NP t=He p=PR
P
pt=O nt=reckons pp=O np=VBZ
B-VP t=reckons p=VBZ pt=He nt=the pp=PR
P
np=DT
B-NP t=the p=DT pt=reckons nt=current pp=VB
Z
np=JJ
I-NP t=current p=JJ pt=the nt=account pp=DT np=NN
I-NP t=account p=NN pt=current nt=O pp=JJ np=NN
 We want to train a chunker (also sequences)
 Tagged data
He/PRP reckons/VBZ the/DT current/JJ account/NN
 Features per token:
 token (t), pos (p), previous token (pt), next token (nt), previous pos (pp)
next pos (np)

My approach
1. Obtain features for each single token
1. Input KAF
2. Output  ‘TAB’ format
3. Our own customized feature extractor
2. Generate the final set of features (context)
1. Input TAB format
2. Output ‘CRF’ format
3. One existing python script

KAF feature extractor
 Python script that reads a KAF file and generates the
‘TAB’ format
 KafParser + Python script

KAF feature extractor
 Python script that reads a KAF file and generates the
‘TAB’ format

Converting to CRF
 Python script:
 Specify the format of your tab file
 Specify the “templates” (features) for each token

Converting to CRF
 Python script:
 Specify the “templates” for each token

Converting to CRF
 Python script:
 Specify the “templates” for each token
 Run the script using the TAB and generate OUT

Extracting opinions
 Training
1. Get all KAF files with annotations
2. Obtain TAB file for each file
3. Convert to CRF for each file
4. Create a single training file with all CRF files
5. Train the MODEL with crfsuite
crfsuite learn –m my_model my_data.crf

Extracting opinions
 Tagging one kaf file
1. Generate TAB file
One line for each TOKEN (<wf>)
2. Convert to CRF
3. Tag with the trained model
crfsuite tag –m my_model my_kaf.crf
4. Read and align output from crfsuite

Extracting opinions
 Generate TAB file
 One line for each TOKEN (<wf>)
 Convert to CRF
 Tag with the trained model
 Read and align output from crfsuite

Extracting opinions
 Generate TAB file
 One line for each TOKEN (<wf>)
 Convert to CRF
 Tag with the trained model
 Read and align output from crfsuite
 Generate the KAF layer

How to adapt this?
1. Adapt the KAF feature extractor (+++)
1. Adapt the TAB-CRF converter (+)
2. Train your model (+)
3. Adapt the CRF-> KAF de-converter (++)

CLTL presentation: training an opinion mining system from KAF files using CRF

More Related Content

What's hot

Viewers also liked

More from Rubén Izquierdo Beviá

Recently uploaded