15. Opportunity for
subject-based access
• Studies underline end-users interest in
topical searches, but :
• inter-indexing inconsistency
• cost of manual indexing
• Possibilities and limits of using automated
methods to provide a subject-based access ?
16. Unsupervised machine learning
• Often used for exploratory data
analysis by clustering documents
in very large corpora with
unknown content
• “Distant reading” techniques
within the Digital Humanities
• Two popular methods :
• Topic Modeling (TM)
• Word Embeddings (WE)
17. Case-study on non-supervised ML
• Combination of
• LDA
• Word2Vec
• To create automated links to
Eurovoc per document
18. Corpus
• 24.787 pdf documents, representing 138,3 GB
• Period 1958 -1982, with documents in French,
Dutch, German, Italian, Danish, English and Greek
• Only descriptive metadata available for the fonds
creator
• Little value from a traditional archival perspective
but as an aggregate it offers the possibility to analyse
policy development through time
24. K-parameter
• Small number of topics results in too generic
categories, high number results in topics which
are not sufficiently representative for the corpus
• Depends on what you want :
• cover the entire corpus by making sure
every document is indexed
• or to discover specific semantics …
25. Finding a balance
• Topic “eec regulation council commission
community decision european december
amended article” => 0.31336
• Topic “energy nuclear coal projects gas oil
community power heat fuel ” => 0.03307
29. Topic labeling
• Hulpus et al (2013) & Allahyaria and Kochuta
(2015) use the graph structure of DBPedia to
rank the different label candidates
• But - topics may contain different concepts and
the graph structure of DBPedia as a knowledge
structure is not terribly coherent …
• Our approach : use pre-trained Word2Vec to
spot which terms form semantic clusters and
match those with Eurovoc
31. Topics as concepts
• Usage of W2V to help us detect different
concepts within one topic by making use of the
distance between terms
• For example : “labour, farm, poultry, sheep, pig,
land, family, income, holding, purchased”
• Three concepts within one topic :
• labour, farm, poultry, sheep, pig, land
• family
• income, holding, purchased
32. Reconciliation
• In order to perform the matching with
Eurovoc, we are testing to
• Either focus on the most “centroid” term
from a concept and see how many match
• Use the structure of Eurovoc for decision
making (e.g. pick the term on the deepest
level or which has the most non-
descriptors attached to it)