On the Quest for Changing Knowledge. Capturing emerging entities from social media. WebScience 2016 DDI
Massive data integration technologies have been recently used to produce very large ontologies. However, knowledge in the world continuously evolves, and ontologies are largely incomplete for what concerns low-frequency data, belonging to the so-called long tail.
Socially produced content is an excellent source for discovering emerging knowledge: it is huge, and immediately
reflects the relevant changes which hide emerging entities. Thus, we propose a method for
discovering emerging entities by extracting them from social content.
We start from a purely-syntactic method as a baseline, and we propose a semantics-based method based on entity recognition and DBpedia. The method associates candidate entities to feature vectors, built
from social content by using co-occurrence, and then extracts the emerging entities by using feature similarity measures.
Once instrumented by experts through very simple initialization, the method is capable of finding emerging
entities and extracting their relevant relationships to given types; the method can be
continuously or periodically iterated, using the already identified emerged knowledge as new starting point.
We validate our method by applying it to a set of diverse domain-specific application scenarios, spanning fashion, literature, exhibitions and so on. We show the approach at work and we demonstrate its effectiveness on datasets with different characterization in terms of coverage, dynamics and size.
Initial pruning of candidates based on
TF-DF:= df * tf / (N – df +1)
(*) variant of TF-IDF that does not discount document frequency because we
are actually happy about frequent appearance
(we don’t look for information entropy!)
4,400 strategies evaluated
44 alternative feature vectors
(12 basic features and 32 aggregations)
9 different weighting values for aggregations
5 levels of recall for entity extraction
3 different distances
From 4,400 down to 10 strategies
Eliminating the less relevant parameters