Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations

•

0 likes•3 views

Authors: Anastasia Zhukova, Felix Hamborg, Bela Gipp Publication date: 2020/08/05 Journal: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL) Abstract: Dataset exploration is a set of techniques crucial in many research and data science projects. For textual datasets, commonly used techniques include topic modeling, document summarization, and methods related to dimension reduction. Despite their robustness, these techniques suffer from at least one of the following drawbacks: document summarization does not explicitly set documents in relation, the others yield summaries or topics that often are difficult to interpret and yield poor results for topics that consist of context-dependent terms. We propose a method for dataset exploration that employs cross-document near-identity resolution of mentions of semantic concepts, such as persons, other named entity types, events, actions. The method not only sets documents in relation and thus allows for comparative dataset exploration, but also yields well interpretable document representations. Additionally, due to the underlying approach for cross-document resolution of concept mentions, the method is able to set documents in relation as to their near-identity terms, e.g., synonyms that are not universally valid but only in the given dataset. PDF: https://dl.acm.org/doi/pdf/10.1145/3383583.3398562

Science

Interpretable and Comparative Textual Dataset
Exploration Using Near-Identity Mention Relations
Anastasia Zhukova1, Felix Hamborg2, Bela Gipp1
Problem
• Text dataset exploration lacks of methods
allow text comparison and content
interpretation
• SoTA: uni-/bigram instead of phrase level
• Lack of identification of complex
coreferential relations
2. Detailed representation
On cluster level: NI-CDCR (near-identity cross-document coreference resolution)
Contact
Web: https://dke.uni-wuppertal.de/en/
Email: zhukova@uni-wuppertal.de, felix.hamborg@uni.kn
1 University of Wuppertal, 2 University of Konstanz
Research question
How to explore text datasets to ensure
comparability across texts, interpretability of
representation, and employment of the
complex semantic relations among text
components?
Features
• Histogram-matrix → comparative text
exploration
• Dimensions
o LDA-based clusters
o user-defined metadata / extracted
information
• Resolution of semantically complex and
lexical diverse concepts (NI-CDCR)
o Identity: “Cuba” = “the Caribbean Pearl”
o Near-identity: “US government” ~ “Trump
Administration” ~ “White House”
o Context identity: “resuming its nuclear
program” = “Tehran's nuclear ambitions”
• Easy-to-interpret document representation
• Three levels of analysis: cluster, detailed,
and document view
• User-controlled number of concepts
Metadata:
news
slant
/
frame
/
subtopic
0+2 5 6 7 3 1 4+9
8
RR
R
M
LL
L
0 2
Document view
Metadata: event id
Detailed view
Methodology
1. High-level representation
On article level: LDA
• cosine similarity
between LDA-topic
vectors
• small squares: articles
• 𝑠𝑖𝑚 ≥ 0.7 → related
0
1
2
3
4
5
6
7
8
9
4+9
4+9
0+2
0+2
Event ids
“Trump” entity is mentioned across all clusters
253
Cluster view

Similar to Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations

Learning Relations from Social Tagging DataHang Dong

Creating metadata for data visualizationgesinaphillips

Modeling Causal Reasoning in Complex Networks through NLP: an IntroductionLuca Nannini

NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...ssuser4b1f48

Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment Paris Sud University

Easing embedding learning by comprehensive transcription of heterogeneous inf...paper_reader

TopicModels_BleiPaper_Summary.pptxKalpit Desai

Network analyses of SPSSIKevin Lanning

MacroMicroZoom.pdfMartin Wynne

Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...Anastasia Zhukova

Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf

DH Tools Workshop #1: Text Analysiscjbuckner

A Bridge Not too FarValeria de Paiva

NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...ssuser4b1f48

Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWCValentina Presutti

DCLA14_Haythornthwaite_Absar_PaulinSimon Buckingham Shum

Auto Mapping Texts for Human-Machine Analysis and SensemakingShalin Hai-Jew

Semantic, Cognitive and Perceptual Computing -Human mental representationArtificial Intelligence Institute at UofSC

XCoref: Cross-document Coreference Resolution in the WildAnastasia Zhukova

Natural Language ProcessingNimrita Koul

Similar to Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations (20)

Learning Relations from Social Tagging Data

Creating metadata for data visualization

Modeling Causal Reasoning in Complex Networks through NLP: an Introduction

NS-CUK Seminar: H.B.Kim, Review on "metapath2vec: Scalable representation l...

Tutorial@BDA 2017 -- Knowledge Graph Expansion and Enrichment

Easing embedding learning by comprehensive transcription of heterogeneous inf...

TopicModels_BleiPaper_Summary.pptx

Network analyses of SPSSI

MacroMicroZoom.pdf

Interpretable Topic Modeling Using Near-Identity Cross-Document Coreference R...

Concurrent Inference of Topic Models and Distributed Vector Representations

DH Tools Workshop #1: Text Analysis

A Bridge Not too Far

NS-CUK Joint Journal Club: V.T.Hoang, Review on "Heterogeneous Graph Attentio...

Fueling the future with Semantic Web patterns - Keynote at WOP2014@ISWC

DCLA14_Haythornthwaite_Absar_Paulin

Auto Mapping Texts for Human-Machine Analysis and Sensemaking

Semantic, Cognitive and Perceptual Computing -Human mental representation

XCoref: Cross-document Coreference Resolution in the Wild

Natural Language Processing

Recently uploaded

A Scientific PowerPoint on Albert Einsteinxgamestudios8

Factor Causing low production and physiology of mamary GlandRcvets

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure

Taphonomy and Quality of the Fossil RecordSangram Sahoo

Polyethylene and its polymerization.pptxMuhammadRazzaq31

Heads-Up Multitasker: CHI 2024 Presentation.pdfbyp19971001

Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...yogeshlabana357357

FORENSIC CHEMISTRY ARSON INVESTIGATION.pdfSuchita Rawat

NuGOweek 2024 programme final FLYER short.pdfpablovgd

Electricity and Circuits for Grade 9 studentslevieagacer

dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET

PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)kushbuR

Heat Units in plant physiology and the importance of Growing Degree daysBrahmesh Reddy B R

Information science research with large language models: between science and ...Fabiano Dalpiaz

SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxPat (JS) Heslop-Harrison

Warming the earth and the atmosphere.pptxGlendelCaroz

X-rays from a Central “Exhaust Vent” of the Galactic Center ChimneySérgio Sacani

Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...Ansari Aashif Raza Mohd Imtiyaz

HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPTABHISHEK SONI NIMT INSTITUTE OF MEDICAL AND PARAMEDCIAL SCIENCES , GOVT PG COLLEGE NOIDA

Classification of Kerogen, Perspective on palynofacies in depositional envi...Sangram Sahoo

Recently uploaded (20)

A Scientific PowerPoint on Albert Einstein

Factor Causing low production and physiology of mamary Gland

Chemistry Data Delivery from the US-EPA Center for Computational Toxicology a...

Taphonomy and Quality of the Fossil Record

Polyethylene and its polymerization.pptx

Heads-Up Multitasker: CHI 2024 Presentation.pdf

Soil and Water Conservation Engineering (SWCE) is a specialized field of stud...

FORENSIC CHEMISTRY ARSON INVESTIGATION.pdf

NuGOweek 2024 programme final FLYER short.pdf

Electricity and Circuits for Grade 9 students

dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...

PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)

Heat Units in plant physiology and the importance of Growing Degree days

Information science research with large language models: between science and ...

SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx

Warming the earth and the atmosphere.pptx

X-rays from a Central “Exhaust Vent” of the Galactic Center Chimney

Molecular and Cellular Mechanism of Action of Hormones such as Growth Hormone...

HIV AND INFULENZA VIRUS PPT HIV PPT INFULENZA VIRUS PPT

Classification of Kerogen, Perspective on palynofacies in depositional envi...

Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations

1. Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations Anastasia Zhukova1, Felix Hamborg2, Bela Gipp1 Problem • Text dataset exploration lacks of methods allow text comparison and content interpretation • SoTA: uni-/bigram instead of phrase level • Lack of identification of complex coreferential relations 2. Detailed representation On cluster level: NI-CDCR (near-identity cross-document coreference resolution) Contact Web: https://dke.uni-wuppertal.de/en/ Email: zhukova@uni-wuppertal.de, felix.hamborg@uni.kn 1 University of Wuppertal, 2 University of Konstanz Research question How to explore text datasets to ensure comparability across texts, interpretability of representation, and employment of the complex semantic relations among text components? Features • Histogram-matrix → comparative text exploration • Dimensions o LDA-based clusters o user-defined metadata / extracted information • Resolution of semantically complex and lexical diverse concepts (NI-CDCR) o Identity: “Cuba” = “the Caribbean Pearl” o Near-identity: “US government” ~ “Trump Administration” ~ “White House” o Context identity: “resuming its nuclear program” = “Tehran's nuclear ambitions” • Easy-to-interpret document representation • Three levels of analysis: cluster, detailed, and document view • User-controlled number of concepts Metadata: news slant / frame / subtopic 0+2 5 6 7 3 1 4+9 8 RR R M LL L 0 2 Document view Metadata: event id Detailed view Methodology 1. High-level representation On article level: LDA • cosine similarity between LDA-topic vectors • small squares: articles • 𝑠𝑖𝑚 ≥ 0.7 → related 0 1 2 3 4 5 6 7 8 9 4+9 4+9 0+2 0+2 Event ids “Trump” entity is mentioned across all clusters 253 Cluster view

Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations

Recommended

Recommended

More Related Content

Similar to Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations

Similar to Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations (20)

More from Anastasia Zhukova

More from Anastasia Zhukova (9)

Recently uploaded

Recently uploaded (20)

Interpretable and Comparative Textual Dataset Exploration Using Near-Identity Mention Relations