SAE: Structured Aspect Extraction

Meltwater Meetup Budapest - 7 Sep. 2016
Omer Gunes and Tim Furche
Structured Aspect Extraction
Giorgio Orsi
University of Birmingham University of Oxford

Aspect Extraction (AE)
Identifying relevant features of an explicit or implicit entity of interest
The Sony Xperia XZ is the new headliner with top-of-the-line hardware, a bigger
display, a new and improved camera, squared design, and, of course, water-proofing.
Sony Xperia XZ
Entity (explicit) Aspects
new headliner
top-of-the-line hardware
bigger display
new and improved camera
squared design
water-proofing
[Zhang and Liu, 2014]

Sentiment Analysis
Aspect (entity) based
new headliner
top-of-the-line hardware
bigger display
new and improved camera
squared design
water-proofing 0.218
0.476
0.476
0.476
Sony Xperia XZ
0.476
0.641
0.350
course 0.341

⟨ headliner, yes ⟩
⟨ hardware, top-of-the-line ⟩
⟨ display, { yes, bigger } ⟩
⟨ camera, { yes, new, improved } ⟩
⟨ design, squared ⟩
⟨ water-proofing, yes ⟩
Aspect extraction vs attribute extraction
Knowledge Base Construction
Basically, you want the attribute (i.e., aspect term) names and factual values
⟨ OEM, Sony ⟩
⟨ model, Xperia XZ ⟩
[Shin et al., 2015]

Structured Aspect Extraction (SAE)
Victorian two bedroom mid terrace property
Extends AE with fine-grained extraction and typing of complex (i.e., hierarchical) aspects
Victorian two bedroom mid terrace propertyAspect term extraction (ATE)
⟨ { Victorian, ⟨ { two }, bedroom ⟩, mid terrace }, property ⟩Segmentation
⟨ { JJ, ⟨ { CD }, bedroom ⟩, mid terrace }, property ⟩
Typing and Generalisation
modifiers = {qualifiers, quantifiers}

SAE: Why it is hard
Victorian two bedroom mid terrace property located in Cambridge and comprising of
living room with ORIGINAL!!! cupboards, and ORIGINAL!!! picture rail.Stairway off living
room leads to two bedrooms.
Noisy unstructured text (NUT)
bedroom mid terrace
picture rail.Stairway
cupboards
Cambridge
bedrooms
ORIGINAL
property
Cambridge
rail.Stairway
Victorian
Cambridge
rail.Stairway
cupboards
property
room
bedrooms

SAE: Why it is hard
Noisy unstructured text (NUT)
By the time we get to the dependency parser we have lost the battle already
The problems start with the tokenizer
picture rail.Stairway
Victorian two bedroom mid terrace property located in Cambridge and comprising of
living room with ORIGINAL !!! cupboards, and ORIGINAL !!! picture rail.Stairway off living
room leads to two bedrooms.
and continue with the POS tagger
NN NN NN
VBN
NNPNNP
NN NN
NN
NN NN
JJ JJ
JJ
CD VBG
VBG NNP NNP CC
CDVBZ
NNP VBG

Unsupervised SAE
Large corpus of homogeneous documents (50k ~ 250k)
same domain (use a classifier), preferably no bundles
Normalisation and tagging
tokenisation (NUT specific)
orthography normalisation (most common orthography)
POS tagging (Hepple’s on TreeBank)
NP chunking (Ramshaw – Mitchell)
NP Clustering
head noun lemmatization (approx. last noun in NP)
frequent head nouns -> aspect terms
Segmentation
cPMI optimal parsing of an NP -> modifiers / multi-words
Generalisation and typing
structured aspect patterns (SAP)
entity, aspect term, qualifier, quantifier

NP Clustering
Two further double bedrooms
Three further double bedrooms
A further double bedroom
Two first floor bedrooms
…
Input: A large number of (normalized) NPs
Abstraction of numerical expressions + removal of non-content word prefixes
CD further double bedrooms
CD further double bedrooms
DT further double bedroom
CD first floor bedrooms
{ CC, DT, EX, IN, PRP, PUNC }
Filter head nouns (exp. set but 70-75% of the corpus) and cluster them
Dameraau-Levenshtein to compensate for mispells
{ CD further double bedrooms
further double bedroom
CD first floor bedrooms }
[ bedroom ]

Segmentation
Victorian two bedroom mid terrace property
Basically, we have to assign the elements of the NP modifiers to:
a multi-word expression
an aspect term
find sub-patterns
⟨ Victorian ⟨ two bedroom mid ⟩ ⟨ terrace ⟩ property ⟩
⟨ Victorian ⟨ two bedroom ⟩ ⟨ mid terrace ⟩ property ⟩
⟨ Victorian ⟨ two bedroom ⟩ mid terrace property ⟩
Valid parenthesizations
balanced parenthesization (algorithms and data structures – DP)
for each level k of the parenthesization
we have at least two elements
it either terminates with a head of cluster OR it contains no head of cluster

Segmentation
cPMI-optimal parenthesizations
Adaptation of corpus-wide Point-wise Mutual Information (cPMI)
mentation is corpus-level significant point-wise mutual information (cPMI) (Damani and Ghong
13). Our definition of cPMI uses the corpus of NPs instead of arbitrary descriptions. Let C be the set o
clusters produced as described above. We denote by fC(t) the frequency of the string t in all cluste
C, i.e., obtained by summing up all of the occurrences of t in all clusters. Let 0 < < 1 be th
malization factor defined as in (Damani and Ghonge, 2013), and tkw, the concatenation of two string
d w. We then define cPMIC(t, w) as follows:
cPMIC(t, w) = log
fC(tkw)
fC(t) · fC(w)
|C| +
p
fC(t) ·
q
ln( )
( 2)
The cPMI value is used to determine whether a token should be associated with (i) the head noun
a nested token representing the head of a different cluster, thus possibly inducing a nested structur
iii) an adjacent token, thus forming a multi-word expression.
⟨ Victorian ⟨ two bedroom ⟩ ⟨ mid terrace ⟩ property ⟩
Parenthesization that maximises cPMInp becomes a (ground) structured aspect pattern (SAP)
⟨ { Victorian, ⟨ { two }, bedroom ⟩, mid terrace }, property ⟩
cPMInp = cPMIC (Victorian, property) + cPMIC (two bedroom, property) +
cPMIC (mid terrace, property) + cPMIC (two, bedroom) + cPMIC (mid, terrace)
[Damani and Ghonge 2013]

Given a (ground) SAP…
Victorian → property-qualifier
two bedroom → property-qualifier
mid terrace → property-qualifier
property → property
two → bedroom-quantifier
bedroom → property{

Ground SAPs have good precision but pretty bad recall
POS-based pattern generalization
non-content words are always generalized
aspect terms generalized only if a nested pattern with a ground head exists
qualifiers are generalized one-at-a-time
⟨ { JJ, ⟨ { CD }, bedroom ⟩, mid terrace }, property ⟩
⟨ { Victorian, ⟨ { CD }, bedroom ⟩, JJ terrace }, property ⟩
⟨ { Victorian, ⟨ { two }, bedroom ⟩, mid JJ }, property ⟩
⟨ { Victorian, ⟨ { two }, bedroom ⟩, JJ }, property ⟩
⟨ { Victorian, ⟨ { two }, bedroom ⟩, mid terrace }, NN ⟩

no labelled dataset is available. We take the heads of the noun-phrase clusters as a surrogate of the set of
valid aspects. The analysis is limited to aspect terms. Let T be the set of valid aspect terms as defined
above, and E be the set of aspect terms produced by an SAP P. The score of P is computed as:
⌫(P) =
|T|
P
e2E(1 maxt2T (dist(t,e)
len(t) < 0.2))
· log |T| ⌫(P) 2 [0, 1]
where dist(t, e) denotes the Dameraau-Levenshtein edit distance between two strings t and e and len(·)
denotes the length of the string. Patterns scoring less than an experimentally set threshold are eliminated.
3 Evaluation
Our method (SysName) is implemented in Java. All experiments are run on a Dell OptiPlex 9020 with
two quad-core i7-4770 Intel CPUs at 3.40GHz and 32GB RAM, running Linux Mint 17 Qiana. All
resources used in the evaluation are made available for replicability.2
Datasets and metrics We use three groups of datasets in our evaluation (Table 1): The first two con-
sist of the SemEval143 and SemEval154 datasets used for the aspect term extraction (ATE) and opinion
where:
is the set of reference aspect terms (cluster heads)
is the Dameraau – Levenshtein distance
is the length of the string
gate of the set of
terms as defined
uted as:
1]
and e and len(·)
d are eliminated.
tiPlex 9020 with
nt 17 Qiana. All
The first two con-
ds of the noun-phrase clusters as a surrogate of the set of
terms. Let T be the set of valid aspect terms as defined
ed by an SAP P. The score of P is computed as:
st(t,e)
en(t) < 0.2))
· log |T| ⌫(P) 2 [0, 1]
htein edit distance between two strings t and e and len(·)
less than an experimentally set threshold are eliminated.
. All experiments are run on a Dell OptiPlex 9020 with
z and 32GB RAM, running Linux Mint 17 Qiana. All
able for replicability.2
Pattern scoring [Gupta and Manning, 2014]
Score patterns on their ability to discriminate between correct and incorrect extractions
No labelled dataset available → use cluster heads as surrogate labels
no labelled dataset is available. We take the heads of the noun-phrase clusters as a surrogate of the set o
valid aspects. The analysis is limited to aspect terms. Let T be the set of valid aspect terms as define
above, and E be the set of aspect terms produced by an SAP P. The score of P is computed as:
⌫(P) =
|T|
P
e2E(1 maxt2T (dist(t,e)
len(t) < 0.2))
· log |T| ⌫(P) 2 [0, 1]
where dist(t, e) denotes the Dameraau-Levenshtein edit distance between two strings t and e and len(
denotes the length of the string. Patterns scoring less than an experimentally set threshold are eliminated
3 Evaluation
Our method (SysName) is implemented in Java. All experiments are run on a Dell OptiPlex 9020 wit
wo quad-core i7-4770 Intel CPUs at 3.40GHz and 32GB RAM, running Linux Mint 17 Qiana. A
resources used in the evaluation are made available for replicability.2
Datasets and metrics We use three groups of datasets in our evaluation (Table 1): The first two con
sist of the SemEval143 and SemEval154 datasets used for the aspect term extraction (ATE) and opinio
arget expression (OTE) subtasks of the aspect-based sentiment analysis (ABSA) task. The datasets pro
Patterns scoring less than an experimentally set threshold are eliminated

Pattern Matching
Pattern references
nested patterns are not repeated, they reference to each others
enables parallel SAP generalisation and matching
⟨ { JJ, #SAPbedroom , mid terrace }, property ⟩
⟨ { Victorian, #SAPbedroom , JJ terrace }, property ⟩
⟨ { Victorian, #SAPbedroom , mid JJ }, property ⟩
⟨ { Victorian, #SAPbedroom , JJ }, property ⟩
SAPproperty
⟨ { Victorian, #SAPbedroom , mid terrace }, NN ⟩
SAPNN
SAPbedroom
⟨ { two }, bedroom⟩
⟨ { CD }, bedroom ⟩
How fast?
Induction: 10-14 msec / sentence
Matching 2-3 msec / text
bottlenecks: morphological analysis and cPMI-optimal segmentation

Evaluation
Datasets
SemEval OTE/ATE only useful for aspect terms
We provide a SAED (Structured Aspect Extraction Dataset - http://bit.ly/2caeXf3)
consists of both NUT and (semi-) formal English texts. We provide GS annotations for 150 texts equally
distributed across the six domains. The GS provides and average of 355 aspect terms, 30 quantifiers,
430 qualifiers, and 45 nested aspects per domain. Annotations were produced by 6 independent anno-
tators ( =87%). We use standard recall, precision, and F1 score metrics. However, due to the different
granularity of the output produced by the systems and of the GS annotations, the definition of a correct
extraction varies slightly with each evaluation task.
Table 1: Datasets
DATASET DOMAIN SIZE (#texts) SOURCES CATEGORY FORMALITY TYPE
SemEval14
restaurants 3k + 800 GS (*) Citysearch service NUT evaluative
laptops 3k + 800 GS (*) N/A product NUT evaluative
SemEval15
restaurants 254 + 96 GS Citysearch service NUT evaluative
hotels N/A + 30 GS Citysearch service NUT evaluative
SAED
chairs 94k + 25 GS Amazon, GumTree product NUT descriptive
hotels 20k + 25 GS TripAdvisor service formal descriptive
real estate 87k + 25 GS RightMove product semi-formal descriptive
restaurants 115k + 25 GS TripAdvisor service formal descriptive
shoes 46k + 25 GS Amazon, GumTree product NUT descriptive
watches 10k + 25 GS Amazon, GumTree product NUT descriptive
Comparative evaluation – Simplified SAE The method by (Kim et al., 2012), hence ATL, is currently
the closest to SAE we are aware of. We have obtained from the authors the dataset used in their evaluation
2
All resources are available at http://bit.ly/29YtM3K and include: the SAED dataset and GS, our reimplementations of
IIITH and ATL, a compiled version of SysName, and all output files generated by all systems.
3
4
http://alt.qcri.org/semeval2015/task12/
Systems
The SemEval 14/15 systems
IITH [Raju et al., 2009]
ATL [Kim et al., 2012]
ATEX [Zhang and Liu, 2014]

Evaluation
ATE setting (SemEval Dataset)
0
20
40
60
80
100
HIS_RD
DLIREC (U)
NRC-Can
UNITOR (U)
XRCE
SAP_RI
IITP
SeemGo
ATEX (U)
IIITH (U)
ATL (U)
Sysname (U)
Supervised Unsupervised
Restaurants Laptops
(a) SemEval14 Dataset
0
20
40
60
80
100
ISISLif
LT3 (U)
Elixa (U)
Sentiue
UFGRS
Wnlp
V3
IIITH (U)
ATL (U)
ATEX (U)
SysName (U)
Restaurants Hotels
(b) SemEval15 Dataset
0
20
40
60
80
100
R P F1 R P F1 R P F1 R P F1 R P F1 R P F1
IIITH ATL ATEX SysName
SemEval 2014
ATL (U)
Sysname (U)
vised
0
20
40
60
80
100
ISISLif
LT3 (U)
Elixa (U)
Sentiue
UFGRS
Wnlp
V3
IIITH (U)
ATL (U)
ATEX (U)
SysName (U)
Restaurants Hotels
(b) SemEval15 Dataset
SemEval 2015

Evaluation
Simplified SAE setting (SAE Dataset)
but not an implementation of the system. We have reimplemented the method and successfully repro
duced the experimental results described in the original paper. Figure 1 shows a comparison between AT
and SysName on the SAED dataset. An extraction is correct if modifiers and aspect terms match exactl
the GS annotations, and if modifiers are correctly typed as qualifiers or quantifiers. This is a simplifie
SAE setting where we do not require correct linking of modifiers to aspect terms. SysName performs 33%
0
20
40
60
80
100
Chairs Hotels Real Estate Restaurants Shoes Watches
ATL SysName
Figure 1: SysName vs. ATL on simplified SAE (SAED dataset)
better than ATL in average, outperforming it in all domains. Besides being unable to extract hierarchica
structures, a visible issue in ATL is the inability to establish and leverage the semantic connection betwee
Correct extraction: correct aspect term +
correct modifier +
correct typing for the modifier (i.e., qualifier / quantifier)

Evaluation
Full SAE setting (SAE Dataset)
Correct extraction: correct aspect term +
correct modifier +
correct typing for the modifier (i.e., X–quantifier, Y–qualifier) +
correct linking (modifier-entity, sub-patterns)
(c) SAED Dataset
Figure 2: SysName vs. others in ATE
es is indeed a much more challenging task than simply identifying them. Another int
s the impact of the generalization on the performance. Generalized SAPs produce 444
ons against the 386 of the ground ones (+15%).
0
20
40
60
80
100
Chairs Hotels Real Estate Restaurants Shoes Watches
ATE Simpl. SAE Full SAE
Figure 3: SysName on full SAE
SAE is substantially harder than ATE/OTE and simplified SAE

Evaluation
Effect of corpus size (SAE Dataset)
The larger the corpus… the better?
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
AVG R AVG P AVG F1
(a) SAE Task
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
AVG R AVG P AVG F1
(b) ATE Task
Figure 4: Performance vs. corpus size (average – SAED dataset)
0% 100%
P AVG F1
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
AVG R AVG P AVG F1
(b) ATE Task
nce vs. corpus size (average – SAED dataset)
wn of this experiment by domain for the ATE and SAE tasks re-
o draw further conclusions on the relationship between the size
he SAPs. There is a relationship between the variety of features
y to induce good quality SAPs. For domains such as, e.g., chairs,
g from 25% of the size of the corpus we do not notice substantial
n be explained by the nature of the features in these domains that
models of the products, types of real estate properties, etc. In the
s are much more variegated in features, e.g., restaurant and hotel
SAE setting
ATE setting

Evaluation
Effect of corpus size (SAE Dataset)
Not necessarily… your often reach a point where more data is not going to help
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Chairs R
Chairs P
Chairs F1
(a) chairs
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Hotels R Hotels P Hotels F1
(b) hotels
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Real Estate R
Real Estate P
Real Estate F1
(c) realestate
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Restaurants R
Restaurants P
Restaurants F1
(d) restaurants
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Shoes R Shoes P Shoes F1
(e) shoes
0
20
40
60
80
100
1% 5% 10% 25% 50% 100%
Watches R Watches P Watches F1
(f) watches

What’s next
Injecting supervision
Several places…, clustering, pattern scoring, and typing probably the most important ones
Dynamic cut-off thresholds
Use test sets to adjust corpus size and thresholds
Aspects not in NPs
Named entities, relations, other grammatical forms
e.g., living room with sash windows
Automatically determine the domain
Map the NP cluster heads to an existing KB (e.g., BabelNet) and use their graph for scoping

References
[Shin et al.2015] Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and
Christopher Re ́. 2015. Incremental knowledge base construction using deepdive.
PVLDB, 8(11):1310–1321.
[Raju et al.2009] S. Raju, P. Pingali, and V. Varma. 2009. An unsupervised approach to
product attribute extraction. In Proc. of ECIR, pages 796–800.
[Ramshaw and Mitchell1999] L. A. Ramshaw and M. P. Mitchell. 1999. Text chunking
using transformation-based learning. In Armstrong S. et Al, editor, Natural Language
Processing Using Very Large Corpora, volume 11 of Text, Speech and Language
Technology, pages 157–176.
[Kim et al.2012] D. S. Kim, K. Verma, and P. Z. Yeh. 2012. Building a lightweight semantic
model for unsupervised information extraction on short listings. In Proc. of EMLNP,
pages 1081–1092.
[Zhang and Liu2014] Lei Zhang and Bing Liu, 2014. Aspect and Entity Extraction for
Opinion Mining, pages 1–40. Springer Berlin Heidelberg.

SAE: Structured Aspect Extraction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SAE: Structured Aspect Extraction

Similar to SAE: Structured Aspect Extraction (20)

More from Giorgio Orsi

More from Giorgio Orsi (20)

Recently uploaded

Recently uploaded (20)

SAE: Structured Aspect Extraction