Authors: Anastasia Zhukova, Felix Hamborg, Karsten Donnay, Bela Gipp
Publication date: 2021/02/26
Conference: Diversity, Divergence, Dialogue: 16th International Conference, iConference 2021, Beijing, China, March 17–31, 2021,
Proceedings, Part I 16, Pages
514-526, Publisher Springer International Publishing
Abstract: Unsupervised concept identification through clustering, i.e., identification of semantically related words and phrases, is a common approach to identify contextual primitives employed in various use cases, e.g., text dimension reduction, i.e., replace words with the concepts to reduce the vocabulary size, summarization, and named entity resolution. We demonstrate the first results of an unsupervised approach for the identification of groups of persons as actors extracted from a set of related articles. Specifically, the approach clusters mentions of groups of persons that act as non-named entity actors in the texts, e.g., “migrant families” “asylum-seekers.” Compared to our baseline, the approach keeps the mentions of the geopolitical entities separated, e.g., “Iran leaders” “European leaders,” and clusters (in)directly related mentions with diverse wording, e.g., “American officials” “Trump Administration.”
https://www.gipp.com/wp-content/papercite-data/pdf/zhukova2021.pdf
Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons
1. Concept Identification of
Directly and Indirectly Related Mentions
Referring to Groups of Persons
Anastasia Zhukova1 Felix Hamborg2,4 Karsten Donnay3,4 Bela Gipp1,4
1University of Wuppertal,
Germany
2University of Konstanz,
Germany
3University of Zurich,
Switzerland
4Heidelberg Academy of Sciences and
Humanities, Germany
07 February 2023
2. Motivation
07 February 2023 2
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
Migrant caravan of asylum seekers reaches U.S. border. By Sunday afternoon, a group of about 150 of the
caravan's members had begun crossing into the U.S.
People who request protection at a United States entry point must be referred to an asylum officer for a
screening. Asylum-seekers are typically held up to three days at the border.
The human stakes for the individual migrants planning to seek asylum Sunday were at least as high. About
80 U.S. families have also offered to sponsor migrants seeking asylum. Lawyers who went to Tijuana
denied any coaching of the roughly 400 people in the caravan.
Central American migrants and supporters of the migrant caravan from the U.S. side looking south into
Mexico on April 29, 2018.
The message was intended as a show of support for the Central American transgender women seeking
asylum.
These migrants will decide whether to present themselves to U.S Border officers at the San Ysidro port of
entry and apply for asylum.
3. Motivation
07 February 2023 3
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
Migrant caravan of asylum seekers reaches U.S. border. By Sunday afternoon, a group of about 150 of the
caravan's members had begun crossing into the U.S.
People who request protection at a United States entry point must be referred to an asylum officer for a
screening. Asylum-seekers are typically held up to three days at the border.
The human stakes for the individual migrants planning to seek asylum Sunday were at least as high. About
80 U.S. families have also offered to sponsor migrants seeking asylum. Lawyers who went to Tijuana
denied any coaching of the roughly 400 people in the caravan.
Central American migrants and supporters of the migrant caravan from the U.S. side looking south into
Mexico on April 29, 2018.
The message was intended as a show of support for the Central American transgender women seeking
asylum.
These migrants will decide whether to present themselves to U.S Border officers at the San Ysidro port of
entry and apply for asylum.
Named Entity Recognition & Coreference Resolution?
4. Motivation
07 February 2023 4
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
Migrant caravan of asylum seekers reaches U.S. border. By Sunday afternoon, a group of about 150 of the
caravan's members had begun crossing into the U.S.
People who request protection at a United States entry point must be referred to an asylum officer for a
screening. Asylum-seekers are typically held up to three days at the border.
The human stakes for the individual migrants planning to seek asylum Sunday were at least as high. About
80 U.S. families have also offered to sponsor migrants seeking asylum. Lawyers who went to Tijuana
denied any coaching of the roughly 400 people in the caravan.
Central American migrants and supporters of the migrant caravan from the U.S. side looking south into
Mexico on April 29, 2018.
The message was intended as a show of support for the Central American transgender women seeking
asylum.
These migrants will decide whether to present themselves to U.S Border officers at the San Ysidro port of
entry and apply for asylum.
Concept extraction: groups of persons
5. Research question
How to automatically identify
conceptually fine-grained clusters of related mentions
referring to groups of people
in an unsupervised way?
07 February 2023 5
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
White House officials
Trump Administration
U.S. officials
American diplomats
demonstrators
28,000 attendees of the demonstration
Mr. Trump’s critics
people opposing Trump’s visit
Direct mentions Indirect mentions
same group of people associated with geo-political entity or organization
6. Pipeline
07 February 2023 6
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
Main principles
• OPTICS merge points in decreasing density
• Hierarchical clustering aggregating linkage criteria to merge clusters
Border mentions and
non-core clusters
Cluster cores Cluster bodies Merge clusters
7. Preprocessing
07 February 2023 7
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
GOP RepublicansUnited_States U.S. American Americans Spanish Mexico
GOP
Republicans
United_States
U.S.
American
Americans
Spanish
Mexico
- extract noun phrases
- keep only headwords, adjectives,
compound, noun, and number modifiers
- vectorize words with word2vec
- construct named-entity grid
- vectorize phrase with averaging weighted words
- similarity = cosine similarity
Americans
U.S.
citizens
U.S. + citizens
Russian
Russian + citizens
Russians
Difference & similarity
between phrases is
hard to distinguish
Americans
U.S.
2 × U.S. + citizens
Russian
2 × Americans
2 × Russian + citizens
2 × Russians
citizens
Easier to identify
related phrases
Russians
“people from migrant caravan”
“people migrant caravan”
8. Cluster cores
1) A & B are similar to each other
2) A & B are similar to sufficient number
other phrases
3) Assemble similarity chains
Core phrases form distinctive initial clusters
𝑨~𝑩
𝑨~𝐁: 𝐂, 𝐃, 𝐄, 𝐆
𝐁~𝐀: 𝐂, 𝐃, 𝐄, 𝐅, 𝐇
other similar phrases
07 February 2023 8
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
core
mentions
specializing
mentions
generalizing
mentions
Republican
establishment
GOP leaders,
Republicans
a Republican
attorney general
𝑨~𝐁, 𝐁~𝐂 → {𝑨, 𝑩, 𝑪}
𝑨
𝑩
𝑪
9. Bodies and borders
Body phrases:
similar to min 1 core phrase
Resolve conflicting terms:
most similar to phrases of a core cluster
Border phrases:
similar to min 2 clustered phrases
Resolve conflicts:
similar to the most clustered terms
07 February 2023 9
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
10. Non-core and merge clusters
Form a cluster
Merge?
07 February 2023 10
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
Non-core clusters:
min 2 phrases are similar
The remaining unclustered phrases are similar to
each other
Merge clusters:
clusters are semantically similar
Semantic similarity of the TF-IDF-weighted clusters’
phrases exceed a threshold
11. Qualitative evaluation: indirect mentions
Hierarchical clustering Our approach
Title Phrases
Americans Former intelligence officials, American officials, White House
officials, outside experts, Officials, Trump administration,
intelligence community, officials, administration
Iranians brutal regime, Iran leaders, exhaustive regimes, inspectors,
inspection regime, Iranian regime, regime
Israelis senior Israeli official, Israelis, Israeli networks, Israeli leader,
Israeli officials
Europeans Europeans, European leaders
Title Phrases
officials American officials, White House officials,
outside experts, Officials, officials, Israeli
officials
regime administration, brutal regime, exhaustive
regimes, Iranian regime, regime
leaders Iran leaders, Israeli leader, European
leaders
? senior Israeli official, Israelis, Europeans
07 February 2023 11
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
12. Qualitative evaluation: direct mentions
Hierarchical clustering Our approach
Phrases
Central American migrants, asylum-seekers, Similar migrant
groups, Central Americans, gay migrants, American sponsors,
Central American children, several American advocacy groups,
Asylum-seeking immigrant, Central American transgender
women, refugees, undocumented immigrants, immigrant rights
activists, Asylum-seekers, individuals, queer, migrant families,
legitimate asylum-seekers, Migrant caravan, migrants, individual,
caravan main organizing group, several groups, asylum seekers,
families, his case, smugglers, immigration judges, particular group,
caravan, sponsor, several groups, American sponsor, nonprofit
group, children, Migrants, groups, protesters, his children, many
migrants, group, her children, Immigrants, activists, their children,
immigrants
Phrases
Central American migrants, Central American
children, several American advocacy groups, several
groups, Other administration officials
asylum-seekers, gay migrants, refugees,
undocumented immigrants, Asylum-seekers,
migrants, asylum seekers, smugglers, Migrants,
Immigrants, immigrants
Similar migrant groups, caravan main organizing
group, several groups, groups, protesters, group,
activists
migrant families, families, children, his children, her
children, their children, U.S. families, his family
07 February 2023 12
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
13. Conclusion and Future work
• Resolved reliably mentions related to geo-political entities or organizations
• Clustered mentions while maintaining a fine-grained level of conceptualization
Future work
• Use as a component to cross-document coreference resolution
• Quantitative evaluation
• Context-dependent word/phrase sense disambiguation
07 February 2023 13
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
14. References
1. Hamborg, F., Zhukova, A., Gipp, B.: Automated identification of media bias by word choice and labeling in news articles. In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL)
(Jun 2019)
2. Hamborg, F., Zhukova, A., Gipp, B.: Illegal aliens or undocumented immigrants? Towards the automated identification of bias by word choice and labeling. In: Proceedings of the iConference 2019
(Mar 2019)
3. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of
Data. p. 49–60. SIGMOD ’99, Association for Computing Machinery, New York, NY, USA (1999).
4. Cambria, E., Poria, S., Hazarika, D., Kwok, K.: Senticnet 5: Discovering conceptual primitives for sentiment analysis by means of context embeddings. In: Thirty-Second AAAI Conference on Artificial
Intelligence (2018)
5. Cha, M., Gwon, Y., Kung, H.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge
Management. pp. 2003–2006 (2017)
6. Chen, N.C., Suh, J., Verwey, J., Ramos, G., Drucker, S., Simard, P.: Anchorviz: Facilitating classifier error discovery through interactive semantic data exploration. In: 23rd International Conference on
Intelligent User Interfaces. pp. 269–280 (2018)
7. Han, X., Wu, Z., Huang, P.X., Zhang, X., Zhu, M., Li, Y., Zhao, Y., Davis, L.S.: Automatic spatially-aware fashion concept discovery. In: Proceedings of the IEEE International Conference on Computer
Vision. pp. 1463–1471 (2017)
8. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky,D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System
Demonstrations. pp. 55–60 (2014),
9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp.
3111–3119 (2013)
10. Murtagh, F., Contreras, P.: Algorithms for hierarchical clustering: an overview. Wiley Inter-disciplinary Reviews: Data Mining and Knowledge Discovery2(1), 86–97 (2012)
11. Subramanian, S., Roth, D.: Improving generalization in coreference resolution via adversarial training. In: Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM
2019). pp. 192–197. Association for Computational Linguistics, Minneapolis, Minnesota (Jun 2019)
12. Zheng, G., Callan, J.: Learning to reweight terms with distributed representations. In: Proceedings of the 38th international ACM SIGIR conference on research and development in information
retrieval. pp. 575–584 (2015)
07 February 2023 14
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"
15. Thank you for your attention!
Questions?
07 February 2023 15
A. Zhukova et al. "Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons"