The document proposes a method to extract contextual relationships between fashion houses from online blogs and articles. It uses named entity recognition and dependency parsing to extract pairs of fashion house entities and contextual phrases describing their relationships from sentences. It evaluates the extracted pairs and contexts by checking the accuracy of the top 10 relationships and contextual phrases against other sources. It finds that 100% of the top relationships were accurate and 92.1% of extracted contextual phrases were accurate and informative.
Finding Contextual Relationships between Fashion Houses
1. Finding Contextual Relationships between Fashion Houses
Madiha Mubin Sushant Shankar
Department of Computer Science Department of Computer Science
Stanford University Stanford University
mamubin@stanford.edu sshankar@stanford.edu
Abstract online blogs and articles to (i) find pairs of competi-
tive companies in the Fashion industry and (ii) to ex-
Understanding how companies and products tract context to provide meaning to those pairs (Fig-
are compared online can provide insights for ure 1). The notion of context can be extended to ge-
analyzing a market. This paper proposes a
ographical location, various demographics, markets,
method to bootstrap entities (‘players’) in a
market and their relationship to each other.
and even descriptors like ‘launched a new line of
Our corpus was high quality fashion blogs prescription sunglasses’. For instance, from the sen-
on eyewear. Our robust system takes the tence ‘In recent news, Gucci’s rival Prada launched
dataset and discovers competitive associations a new line of retro sunglasses’, we want to extract
between different fashion houses and returns [‘Gucci’, ‘Prada’, ‘launched a new line of retro sun-
weights for each association and phrases that glasses’].
describe them. This is done in three steps:
a) looking for patterns in sentences that are
of a competitive nature, b) using Parts of
Speech tagging and Named Entity Recognizer
to extract singleton and paired entities, and c)
traversing the Typed Dependency Graph to ex-
tract contextual phrases that describe the com-
petitive association. Out of the top 10 fashion
houses that we extracted, all the relations we
caught were actual competitive relationships Figure 1: Shows different types of context (e.g.
between fashion houses. Of all long contex- demographic and geographic) that can be associated
tual phrases that we extracted, 92.1% were ac- between two companies about a certain product.
curate and informative.
In order to narrow the scope of the problem, we
focused on Eyewear, however our method can gen-
1 Introduction eralize to other products from the fashion world.
Comparative analysis is an important component of
2 Related Work
natural language understanding. While it makes in-
tuitive sense to compare similar products or compet- Recently, a weakly supervised method was proposed
ing companies, defining the similarity is tricky. For to learn comparative questions and extract compara-
instance, Apple Inc. and Google Inc. are compared tors simultaneously (Li et. al., 2010). The process
in the mobile industry, but if Google Inc. and Face- begins with identifying an Indicative Extraction Pat-
book are compared, then it is more likely a compar- tern (IEP) - a sequence that can be used to identify
ison based on their social networking products. In other comparative questions. From this initial seed,
this paper, our goal is to extract information from more comparative pairs are identified. For each
2. comparator pair extracted, all the questions contain- which contained these words, but felt that the sen-
ing that pair are identified which allows recognition tences around these also could potentially contain
of more comparator pairs. The comparative ques- relations and contexts of a competitive nature.
tions and patterns are scored for reliability and those
that are deemed reliable are stored for use as seeds 3.2 Entity Pair Extraction
to improve performance over time. Comparator pat-
terns are generated using language rules that account
for lexical generalized and specialized patterns. The
reliability of each pattern at every iteration is com-
puted as weighted average of the performance of the
pattern so far in terms of number of the questions
extracted by the pattern and a look-ahead reliability
score which helps reduce the problem of underesti-
mation due to incomplete knowledge at a given iter-
ation.
3 Methodology
In this paper, we augmented the comparator extrac-
tion logic in three ways. First, we expanded our ex-
traction process to be able to deal with regular sen-
tences as well as questions that exhibited competi- Figure 2: Shows our pipeline starting from high-
tive or comparative nature. Second, we used both quality relation-rich blogs to extract patterns and us-
entities in each pair that we extracted from compar- ing those as seeds to scrape more data which might
ative sentences to extract more information about not contain as much of the desired content.
them to inform our pipeline. Finally, using single-
3.2.1 Relation-Rich Tier (Pairs)
tons and pairs of entities, we devised an algorithm to
extract context from the sentences containing those We start with our dataset to ex-
entities. tract reliable patterns of the type
For this project, we were able to accumulate a (CompanyX, CompanyY, [Context]) that
set of approximately 300 fashion-related articles by exhibit competitive behavior.
manually searching the web. In fact, the main ra- Once we identified our target paragraphs, we used
tionale behind using fashion blogs was to find arti- the Stanford CoreNLP parser to extract Parts-of-
cles that explicitly talked about fashion house rival- Speech tags, Named Entity Recognition informa-
ries. One such example is http://www.christian-dior- tion and collapsed typed dependency tree structures
glasses.com/articles/. These articles were blogs and for each sentence (Toutanova, 2003; de Marneffe,
reports of trends and changes in the fashion indus- 2006). Our entity extraction method simply looked
try pertaining to eyewear from year 2010 to current for POS NNP entities that had NER tags ‘ORGA-
time. Using this data source not only allowed us to NIZATION’ or ‘PERSON’. Many of the entities we
extract good quality entitiy pairs, but helped us vali- caught were fashion houses, but not all. For in-
date our content extraction pipeline as it was extract- stance, we caught entities like ‘Tom Cruise’ from
ing temporally relevant information. sentence ‘Tom Cruise is a fan of Ray-Ban sun-
glasses’ or ‘Apple’ from sentence ‘Apple was an-
3.1 Pattern Generation for extraction other attendee in an event with a fashion house’.
We focused on relations between fashion houses of We noticed a common style in our dataset. Often
a competitive nature. We looked for paragraphs in after the entity, a sentence contained further speci-
our corpus that had words with prefixes ‘rival’ or fication of the the line or type of product that was
‘compet’. We considered just looking for sentences being compared or talked about, for instance, ‘Prada
3. sunglasses’. To extract this, we looked for POS NNS Algorithm 1 Extracting Context From Sentence S
tags after any the entities found. For instance, one of GetContext (E: entities in S, C: typed depen-
our patterns looked like: NNP (ORG) – NNS* – NNP dency list for S):
(ORG). As we were looking only at fashion eyewear
sites, these NNS tags tended to denote ‘eyeglasses’ • T ← construct a tree from C
or ‘glasses’. • Traverse from the left most subtree discarding
We hypothesized that for sentences that contained any subtree T such that there is an e ∈ E found
more than one entity, all of the possible entities in T .
could be compared or grouped in some way with
each other. Therefore, sentences with more than a • If there are subtrees left:
pair of entities gave us more entities to compare.
– Use breadth-first search to assemble the
For instance, if we saw three entities(x, y, z), then
context of the sentence starting from the
we generated relations (x, y), (x, z) and (y, z), i.e.
first subtree after all the entities in E have
we generated all possible pairs from a list of enti-
been visited.
ties. For example, from the sentence ‘Additionally,
another Christian Dior glasses competitor Mar- – R ← Concatenate the words, ignore ‘cop’
chon Eyewear renewed its agreement with NIKE.’, and ‘det’ for compression, recheck the or-
we generated three relations: (‘Christian Dior’. dering of the words from the actual sen-
‘Marchon Eyewear’), (‘Christian Dior’, ‘Nike’) and tence.
(‘Marchon Eyewear’, ‘Nike’). – return R
3.2.2 Relation-Poor Tier (Singletons) • else
Most sentences that talk about a fashion house or – return None
new line will not contain its rival or competition.
However, if more is known about the context of that
rival or competition in isolation, it can help us un-
derstand the context in which it is being compared to
another company/line. When there is no more than
one entity in a sentence, then we call these single-
tons where the sentence could have some contextual
information about that entity.
3.3 Context Extraction
The context extraction method described in this pa-
per relies on the key assumption that sentences con-
taining multiple entities as well as context typically Figure 3: Shows a collapsed typed dependency
introduced the entities in the first part of the sentence tree for the sentence ‘This month Christian Dior
and the context in the second part. Consider the sen- glasses competitor Marchon released a line aiming
tence: ‘This month Christian Dior glasses competi- for the teen market’.
tor Marchon released a line aiming for the teen mar- Following this assumption, we were able to de-
ket’. Here the entities identified are ‘Christian Dior’ sign an algorithm (Algorithm 1) for context extrac-
and ‘Marchon’ and the product being compared is tion given a pair of entities that were previously
’glasses’. As depicted in Figure 3, the entities were identified in the sentence by our specialized patterns.
introduced in the left sub-tree of the typed depen-
dency graph and context is within the right subtree. Figure 4 shows a run of Algorithm 1. The left
This pattern was identified by manually analyzing a most subtree is ignored because it contains entities
sample of sentences and by understanding the style and the right subtree is traversed to obtain context.
of documentation of articles in our corpus. As the subtree is traversed, words within ‘det’ or
4. ‘cop’ dependencies are ignored. of the occurrence of each entity individually and the
This reduced the number of traversals without al- number of occurrences with each other entity, we
tering the context of the sentence and was specially decided Pointwise Mutual Information with Contex-
useful for longer sentences. tual Rescaling was a metric that would be able to
capture how often an entity is seen with another en-
tity while taking into account how often it is seen by
itself.
Let P (ei ) be the probability of seeing entity ei
and P (ei , ej ) be the probability of seeing entity ei
and ej in a pair. We had an n × n frequency matrix
f of fashion houses. So fij represents the number of
times ei appeared with ej in our corpus.
Figure 4: Shows the typed dependency tree that
P (ei ,ej )
Algorithm 1 traverses. Nodes circled red are entities • pmiij = log P (ei )P (ej )
and those circles blue are part of the context.
While a majority of sentences in our corpus fol- (assume log(0) = 0)
lowed the assumption we specified earlier, there
were a few that did not. For instance, Figure 5 shows
such an example where the entity ‘Kate Spade’ ap-
fij
pears at the end of the sentence causing the algo- • scaledpmiij = pmiij × fij +1 ×
n n
rithm to miss out on the context completely. We dis- min( k=1 fkj , k=1 fik )
n n
cuss the limitations and possible improvements to min( k=1 fkj , k=1 fik )+1
our algorithm in the discussion section.
Figure 6 below shows these adjusted PMI values for
all 169 of our entities and Figure 7 shows these PMI
values for top 10 of our entities (in terms of the num-
ber of mentions). Note that the PMI values are lower
for the top entities than for others - this is because for
the less mentioned entities, the times they are men-
tioned with another entity is a much higher ratio of
the total times of they’re mentioned.
Figure 5: Shows how the sentence ‘Now that time We set about manually checking relations from
has passed , Bebe is a now famous west coast based the top 10 relations. To do this, we chose a PMI
company that rivals European companies such as threshold 0.3 to consider relations from. We can
Kate Spade eyeglasses’ violated the key assumption. generate a graph of relations using this threshold,
see Figure 8. Checking these relations, 100% of the
4 Evaluation relations were corroborated by another source (i.e.
they were related and competed or worked together).
4.1 Paired Relations Our relation extraction is of course restricted by our
As such there is no gold standard to evaluate rela- corpus and amount of data, as there are many rela-
tions. However, we decided to come up with a met- tions between fashion houses (not to mention other
ric that captures whether we think a relation is say- entities) that we do not catch as they are elsewhere
ing something of significance. Since we have counts on the web or unwritten.
5. Figure 8: Shows the relationships between the top
10 fashion houses. These entities were connected if
their PMI score is greater than or equal to 0.3.
Figure 6 also shows some interesting relationships
between not so frequently occuring fashion houses.
Using a higher threshold > 1, we were able to create
a landscape of relationships between those fashion
houses also.
Figure 6: PMI with contextual discounting for all
fashion houses x fashion houses. The fashion houses
are ranked by the number of occurrences. Note that
the top fashion houses seem to be compared with all
other fashion houses - this is expected (the top fash- Figure 9: Shows relationships between fashion
ion houses are seen as rivals by many other fashion houses that were not in the top 10. These had higher
houses and also are mentioned more, so have more PMI values and so they depict relationships with
customers, suppliers, etc.). higher confidence.
4.2 Context
For many of the edges in graph, we were able to
catch contextual phrases. Table 1 shows some con-
textual phrases that we do catch and Table 2 are
phrases that actually have the wrong meaning. For
Table 2, in the first case, the sentence was ‘In re-
cent Prada eyeglasses news, rival companies Altair
and Tommy Bahama will be teaming up to create
eyewear for the upcoming 10 years’ and the second
sentence was ‘Lastly , Kate Spade eyeglasses com-
petitors Altair and Tommy Bahama grew their cur-
rent partnership for creation eyeglass frames for a
long time’. The sentence structures that we miss are
often more complex than our algorithm can catch as
it talks about a entity x’s rival entity y who partners
Figure 7: PMI with contextual discounting for the with entity z and it is not right to say x partners with
top 10 fashion houses. z.
The signature of ‘Marchon’ is particularly strik- To evaluate the accuracy of our contexts, we first
ing. We investigated this trend using other blogs and look at unique contexts. Due to many short con-
newspaper sources and found out that Marchon is a texts, we have many repetitions - 181 unique con-
supplier of eyewear to someof the top fashion houses texts out of 429 total contexts. We found that none of
as well as a competitor to others. the contexts that were less than word length 3 were
6. Entity 1 Entity 2 PMI Contextual Phrase
Ralph Lauren Michael Bastian 0.35 ‘Held event Monday promote his upcoming designer rimless line’
Marcolin Group Dolomiti 1.13 ‘released their new round eyeglasses’
Marcolin Group Kenneth Cole Prod. 1.20 ‘announced their licensing agreement for distribution of glasses’
RayBan Charmant Group 0.71 ‘told the press that a new vintage eyeglasses line was released’
Christian Dior S. Filo Group 1.54 ‘are partnering online sell their designer eyeglasses’
Prada Marchon 0.74 ‘released line aiming for teen market’
Marchon Prada 0.74 ‘just launched iWear’
Table 1: Positive Contextual Phrases extracted.
Entity 1 Entity 2 PMI Contextual Phrase
Prada Tommy Bahama 0.45 ‘teaming for up create for eyewear for the upcoming 10 years’
Kate Spade Tommy Bahama 0.54 ‘grew their with current partnership creation eyeglass frames long time’
Table 2: Negative Contextual Phrases extracted.
useful. It’s interesting to note that the word ‘do- textual phrases that describe the competitive asso-
nate’ occurred 18 times, the words ‘launched’ and ciation. We showed an evaluation metric that uses
‘renewed’ 12 times, and the word ‘agreed’ 10 ties. Pointwise Mutual Information with Contextual Dis-
Some of these sentences did not elaborate further counting to identify important or ‘surprising’ rela-
contextually, but others are contextual phrases that tionships along with graph representations of the re-
our method does not catch. lationships between entities. We have also shown
We found 183 contexts (42.7% of total contexts) that our context extraction system is quite accurate
and 102 unique contexts (56.4% of total unique con- – of all long contextual phrases that we extracted,
texts) that are longer than two words. Out of these 92.1% were accurate and informative. This method
102 unique contexts, we found three 2.9% sentences can be applied to bootstrap and learn entities and re-
really did not carry much information at all (our lationships in any market given the appropriate cor-
method caught the right phrase but did not mean pus.
anything - a reflection of the sentence) and five
sentences (4.1%) that were in-naccurate (indicating 6 Future Work
problems with our method). This means that 92.1% 6.1 Improve contextual phrase extraction
of our contexts were both accurate and informative.
As pointed out in our evaluation, we miss phrases
Two of these were because we did not completely
that have more complex sentence structure. In ad-
put back prepositions into our context (as it would
dition, we do not catch sentences where the context
require more graph traversals); three were because
is mentioned before the entities are (this would re-
the sentences were of a more complex structure.
quire a simple modification of our algorithm where
5 Conclusion we start at the end instead of the beginning).
In addition, our contextual phrase extraction
We have proposed a method to discover entities sometimes identifies phrases that are common to
and relationships between fashion houses. In ad- multiple entities, but often has phrases that are spe-
dition, we have shown a method to provide con- cific to one entity – our method does not disam-
textual phrases describing these relationships. This biguate these.
was done by: a) looking for patterns in sentences
that are of a competitive nature, b) using Parts of 6.2 Filtering contexts
Speech tagging and Named Entity Recognition to Our method finds phrases for contexts that do not
extract singleton and paired entities, and c) travers- provide much information. This is partially due to
ing the Typed Dependency Graph to extract con- a method that is too simple and needs more rules
7. for different structures of sentences and contexts. References
However, largely it is because most sentences do not Sasha Li et. al 2010. Comparable Entity Mining from
contain relevant information and do not say much. Comparative Question. Proceedings of the 48th An-
There needs to be a method to filter the contexts that nual Meeting of the Association for Computational
are possibly relevant. A classifier can be built using Linguistics:650–658.
features such as the length of the phrase, the number Siddharthan 2011. Text Simplification Using Typed
of modifiers in the context (as a proxy for how spe- Dependencies: A Comparison of the Robustness of
cific the context is), and other such features can be Different Strategies. Proceedings of the 13th Eu-
ropean Workshop on Natural Language Generation
developed to classify an important context vs. a less
(ENLG):2–11.
informative one.
Kristina Toutanova and Christopher D. Manning 2003.
Enriching the Knowledge Sources Used in a Maximum
6.3 Types of relationship Entropy Part-of-Speech Tagger. In Proceedings of the
We believe one of the most exciting contributions of Joint SIGDAT Conference on Empirical Methods in
the paper is the rich set of relations and contexts that Natural Language Processing and Very Large Corpora
(EMNLP/VLC-2000):63–70.
we are able to generate from a relatively small cor-
Kristina Toutanova, Dan Klein, Christopher Manning,
pus with just two seed prefixes to search for (‘rival’ and Yoram Singer. 2003. Feature-Rich Part-of-Speech
and ‘compete’). If a more extensive lexicon is de- Tagging with a Cyclic Dependency Network. In Pro-
veloped for this type of relations, we believe we can ceedings of HLT-NAACL 2003:252–259.
catch even more relations and contexts. Addition- Marie-Catherine de Marneffe, Bill MacCartney and
ally, this paper is focused on catching relationships Christopher D. Manning. 2006. Generating Typed
of a competitive or rival nature. What we found is Dependency Parses from Phrase Structure Parses . In
that since we are looking for sentences around sen- LREC 2006.
tences that contain ‘rival’ or ‘compete’, we are also Jenny Rose Finkel, Trond Grenager, and Christopher
Manning. 2005. Incorporating Non-local Information
catching relationships that are partnerships (licens-
into Information Extraction Systems by Gibbs Sam-
ing agreements for example), or sentences that con- pling . In Proceedings of the 43rd Annual Meeting of
tain multiple relationships (i.e., ‘Rival of x, y, is the Association for Computational Linguistics (ACL
partnering with z)’. We can similarly create a lex- 2005):363–370.
icon for other types of relationships such as partner-
ships, supplier, customer, or other types of relation-
ships between companies and their products.
6.4 Obtaining more data
While these high quality fashion blogs are limited in
number, they give us a starting point by providing
us a set of high confidence relation triples. In future,
we hope to augment our corpus by adding fashion-
related articles from reliable newspapers. Most of
the fashion houses compete over more than one
products and by adding information for other prod-
ucts, we hope to be able to recover more interesting
competitive relationships.
Acknowledgments
We would like to thank Christopher Potts for helpful
discussions and constructive feedback and for Typed
Dependency visualization code.