SlideShare a Scribd company logo
Probabilistic indexing and archives
Possibilities and limits
Seth van Hooland,
March 23nd 2018
TNA, London
http://linkeddatafragments.org
http://course.freeyourmetadata.org/modeling/
http://course.freeyourmetadata.org/modeling/
http://www.tandfonline.com/doi/abs/10.1080/15332748.2017.1400725
Opportunity for
subject-based access
• Studies underline end-users interest in
topical searches, but :
• inter-indexing inconsistency
• cost of manual indexing
• Possibilities and limits of using automated
methods to provide a subject-based access ?
Unsupervised machine learning
• Often used for exploratory data
analysis by clustering documents
in very large corpora with
unknown content
• “Distant reading” techniques
within the Digital Humanities
• Two popular methods :
• Topic Modeling (TM)
• Word Embeddings (WE)
Case-study on non-supervised ML
• Combination of
• LDA
• Word2Vec
• To create automated links to
Eurovoc per document
Corpus
• 24.787 pdf documents, representing 138,3 GB
• Period 1958 -1982, with documents in French,
Dutch, German, Italian, Danish, English and Greek
• Only descriptive metadata available for the fonds
creator
• Little value from a traditional archival perspective
but as an aggregate it offers the possibility to analyse
policy development through time
1950s 1980s
Methodology ?
• How many topics do we
want ?
• How can we label topics ?
K-parameter
• Small number of topics results in too generic
categories, high number results in topics which
are not sufficiently representative for the corpus
• Depends on what you want :
• cover the entire corpus by making sure
every document is indexed
• or to discover specific semantics …
Finding a balance
• Topic “eec regulation council commission
community decision european december
amended article” => 0.31336
• Topic “energy nuclear coal projects gas oil
community power heat fuel ” => 0.03307
Methodology ?
• How many topics do we
want ?
• How can we label topics ?
Topic labeling
• Hulpus et al (2013) & Allahyaria and Kochuta
(2015) use the graph structure of DBPedia to
rank the different label candidates
• But - topics may contain different concepts and
the graph structure of DBPedia as a knowledge
structure is not terribly coherent …
• Our approach : use pre-trained Word2Vec to
spot which terms form semantic clusters and
match those with Eurovoc
https://nlp.stanford.edu/projects/histwords/
Word2Vec
Topics as concepts
• Usage of W2V to help us detect different
concepts within one topic by making use of the
distance between terms
• For example : “labour, farm, poultry, sheep, pig,
land, family, income, holding, purchased”
• Three concepts within one topic :
• labour, farm, poultry, sheep, pig, land
• family
• income, holding, purchased
Reconciliation
• In order to perform the matching with
Eurovoc, we are testing to
• Either focus on the most “centroid” term
from a concept and see how many match
• Use the structure of Eurovoc for decision
making (e.g. pick the term on the deepest
level or which has the most non-
descriptors attached to it)
Image sources
• https://www.flickr.com/photos/c_41/8570463977/
• https://www.flickr.com/photos/dearalice/3662244759/
• https://www.flickr.com/photos/48143042@N05/4418727909/
• https://www.flickr.com/photos/mswid/4153808015/
• https://www.flickr.com/photos/21161327@N04/38524477854
• https://www.flickr.com/photos/jonnygwilliams/14118085866/
• https://www.flickr.com/photos/silverstealth/2389690278/

More Related Content

What's hot

Recommendations for Open Online Education: An Algorithmic Study
Recommendations for Open Online Education:  An Algorithmic StudyRecommendations for Open Online Education:  An Algorithmic Study
Recommendations for Open Online Education: An Algorithmic Study
Hendrik Drachsler
 
DIY RDM Training Kit for Librarians (PK)
DIY RDM Training Kit for Librarians (PK)DIY RDM Training Kit for Librarians (PK)
DIY RDM Training Kit for Librarians (PK)
Robin Rice
 
The Future is All Mine
The Future is All MineThe Future is All Mine
The Future is All Mine
openminted_eu
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
Enno Meijers
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
Simon Price
 
OpenMinted: It's Uses and Benefits for the Social Sciences
OpenMinted: It's Uses and Benefits for the Social SciencesOpenMinted: It's Uses and Benefits for the Social Sciences
OpenMinted: It's Uses and Benefits for the Social Sciences
openminted_eu
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
Jisc RDM
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
LIBER Europe
 
What does Open Science, Open Scholarship look like?
What does Open Science, Open Scholarship look like?What does Open Science, Open Scholarship look like?
What does Open Science, Open Scholarship look like?
Robin Rice
 
Research Data MANTRA Project at Edinburgh
Research Data MANTRA Project at EdinburghResearch Data MANTRA Project at Edinburgh
Research Data MANTRA Project at Edinburgh
EDINA, University of Edinburgh
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Peter Löwe
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
semanticsconference
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
openminted_eu
 
Linked Data
Linked DataLinked Data
Linked Data
Anja Jentzsch
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
LIBER Europe
 
Project MILDRED: Charting Ground for Research Data Management Services at Uni...
Project MILDRED: Charting Ground for Research Data Management Services at Uni...Project MILDRED: Charting Ground for Research Data Management Services at Uni...
Project MILDRED: Charting Ground for Research Data Management Services at Uni...
Mari Elisa Kuusniemi
 
News from the DOI and DataCite Community
News from the DOI and DataCite CommunityNews from the DOI and DataCite Community
News from the DOI and DataCite CommunityFrauke Ziedorn
 
Linked Open Data Approaches within the ARIADNE Project
Linked Open Data Approaches within the ARIADNE ProjectLinked Open Data Approaches within the ARIADNE Project
Linked Open Data Approaches within the ARIADNE Project
ariadnenetwork
 
Wikidata
WikidataWikidata
Wikidata
Anja Jentzsch
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
openminted_eu
 

What's hot (20)

Recommendations for Open Online Education: An Algorithmic Study
Recommendations for Open Online Education:  An Algorithmic StudyRecommendations for Open Online Education:  An Algorithmic Study
Recommendations for Open Online Education: An Algorithmic Study
 
DIY RDM Training Kit for Librarians (PK)
DIY RDM Training Kit for Librarians (PK)DIY RDM Training Kit for Librarians (PK)
DIY RDM Training Kit for Librarians (PK)
 
The Future is All Mine
The Future is All MineThe Future is All Mine
The Future is All Mine
 
20170501 Distributed Network of Digital Heritage Information
20170501  Distributed Network of Digital Heritage Information20170501  Distributed Network of Digital Heritage Information
20170501 Distributed Network of Digital Heritage Information
 
Co-designing Research IT and Research Data Services
Co-designing Research IT and Research Data ServicesCo-designing Research IT and Research Data Services
Co-designing Research IT and Research Data Services
 
OpenMinted: It's Uses and Benefits for the Social Sciences
OpenMinted: It's Uses and Benefits for the Social SciencesOpenMinted: It's Uses and Benefits for the Social Sciences
OpenMinted: It's Uses and Benefits for the Social Sciences
 
Lightning Talks - Intro
Lightning Talks - IntroLightning Talks - Intro
Lightning Talks - Intro
 
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
Text and Data Mining : Making the Most of a Copyright Exception. Julien Roche...
 
What does Open Science, Open Scholarship look like?
What does Open Science, Open Scholarship look like?What does Open Science, Open Scholarship look like?
What does Open Science, Open Scholarship look like?
 
Research Data MANTRA Project at Edinburgh
Research Data MANTRA Project at EdinburghResearch Data MANTRA Project at Edinburgh
Research Data MANTRA Project at Edinburgh
 
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
Libraries in the Big Data Era: Strategies and Challenges in Archiving and Sha...
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?How can repositories support the text mining of their content and why?
How can repositories support the text mining of their content and why?
 
Linked Data
Linked DataLinked Data
Linked Data
 
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
Knowledge Exchange Consensus: Monitoring of Open Access Publications and Cost...
 
Project MILDRED: Charting Ground for Research Data Management Services at Uni...
Project MILDRED: Charting Ground for Research Data Management Services at Uni...Project MILDRED: Charting Ground for Research Data Management Services at Uni...
Project MILDRED: Charting Ground for Research Data Management Services at Uni...
 
News from the DOI and DataCite Community
News from the DOI and DataCite CommunityNews from the DOI and DataCite Community
News from the DOI and DataCite Community
 
Linked Open Data Approaches within the ARIADNE Project
Linked Open Data Approaches within the ARIADNE ProjectLinked Open Data Approaches within the ARIADNE Project
Linked Open Data Approaches within the ARIADNE Project
 
Wikidata
WikidataWikidata
Wikidata
 
OpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of DataOpenMinTeD: Making Sense of Large Volumes of Data
OpenMinTeD: Making Sense of Large Volumes of Data
 

Similar to Probabilistic indexing for archival holdings - possibilities and limits

A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...María Poveda Villalón
 
from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspace
Open Education Global (OEGlobal)
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
Enno Meijers
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
CILIP MDG
 
A distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL IndiaA distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL India
Enno Meijers
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
Enrico Daga
 
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
James Baker
 
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
DataDryad
 
An Architecture based on Linked Data technologies for the Integration of OER ...
An Architecture based on Linked Data technologies for the Integration of OER ...An Architecture based on Linked Data technologies for the Integration of OER ...
An Architecture based on Linked Data technologies for the Integration of OER ...
The Open Education Consortium
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
Nuno Freire
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
eMadrid network
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
datacite
 
Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe
Research Data Alliance
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
Anita de Waard
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
Jennifer Bowen
 
F0372032035
F0372032035F0372032035
F0372032035
inventionjournals
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
Herbert Van de Sompel
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
dgarijo
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
Linked Open Data Cloud
Linked Open Data CloudLinked Open Data Cloud
Linked Open Data Cloud
PretaLLOD
 

Similar to Probabilistic indexing for archival holdings - possibilities and limits (20)

A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
A Reuse-based Lightweight Method for Developing Linked Data Ontologies and Vo...
 
from local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspacefrom local/regional OER Silos towards an OER Global Dataspace
from local/regional OER Silos towards an OER Global Dataspace
 
CLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage informationCLARIAH Toogdag 2018: A distributed network of digital heritage information
CLARIAH Toogdag 2018: A distributed network of digital heritage information
 
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...Managing 'Big Data' in the social sciences: the contribution of an analytico-...
Managing 'Big Data' in the social sciences: the contribution of an analytico-...
 
A distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL IndiaA distributed network of digital heritage information - Unesco/NDL India
A distributed network of digital heritage information - Unesco/NDL India
 
Linked Data at the OU - the story so far
Linked Data at the OU - the story so farLinked Data at the OU - the story so far
Linked Data at the OU - the story so far
 
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
Enabling Complex Analysis of Large-Scale Digital Collections: Humanities Rese...
 
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
Zudilova-Seinstra-Elsevier-data and the article of the future-nfdp13
 
An Architecture based on Linked Data technologies for the Integration of OER ...
An Architecture based on Linked Data technologies for the Integration of OER ...An Architecture based on Linked Data technologies for the Integration of OER ...
An Architecture based on Linked Data technologies for the Integration of OER ...
 
Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...Automated interpretability of linked data ontologies: an evaluation within th...
Automated interpretability of linked data ontologies: an evaluation within th...
 
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
2015 03 19 (EDUCON2015) eMadrid UPM Towards a Learning Analytics Approach for...
 
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
2013 DataCite Summer Meeting - Elsevier's program to support research data (H...
 
Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe Cultural Heritage: when data are much worst than one can believe
Cultural Heritage: when data are much worst than one can believe
 
Big Data and the Future of Publishing
Big Data and the Future of PublishingBig Data and the Future of Publishing
Big Data and the Future of Publishing
 
Linked Data: Why Bother?
Linked Data:  Why Bother?Linked Data:  Why Bother?
Linked Data: Why Bother?
 
F0372032035
F0372032035F0372032035
F0372032035
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
Semantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologistsSemantic web 101: Benefits for geologists
Semantic web 101: Benefits for geologists
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
Linked Open Data Cloud
Linked Open Data CloudLinked Open Data Cloud
Linked Open Data Cloud
 

Recently uploaded

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 

Recently uploaded (20)

Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 

Probabilistic indexing for archival holdings - possibilities and limits

  • 1. Probabilistic indexing and archives Possibilities and limits Seth van Hooland, March 23nd 2018 TNA, London
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 11.
  • 12.
  • 14.
  • 15. Opportunity for subject-based access • Studies underline end-users interest in topical searches, but : • inter-indexing inconsistency • cost of manual indexing • Possibilities and limits of using automated methods to provide a subject-based access ?
  • 16. Unsupervised machine learning • Often used for exploratory data analysis by clustering documents in very large corpora with unknown content • “Distant reading” techniques within the Digital Humanities • Two popular methods : • Topic Modeling (TM) • Word Embeddings (WE)
  • 17. Case-study on non-supervised ML • Combination of • LDA • Word2Vec • To create automated links to Eurovoc per document
  • 18. Corpus • 24.787 pdf documents, representing 138,3 GB • Period 1958 -1982, with documents in French, Dutch, German, Italian, Danish, English and Greek • Only descriptive metadata available for the fonds creator • Little value from a traditional archival perspective but as an aggregate it offers the possibility to analyse policy development through time
  • 20.
  • 21.
  • 22. Methodology ? • How many topics do we want ? • How can we label topics ?
  • 23.
  • 24. K-parameter • Small number of topics results in too generic categories, high number results in topics which are not sufficiently representative for the corpus • Depends on what you want : • cover the entire corpus by making sure every document is indexed • or to discover specific semantics …
  • 25. Finding a balance • Topic “eec regulation council commission community decision european december amended article” => 0.31336 • Topic “energy nuclear coal projects gas oil community power heat fuel ” => 0.03307
  • 26. Methodology ? • How many topics do we want ? • How can we label topics ?
  • 27.
  • 28.
  • 29. Topic labeling • Hulpus et al (2013) & Allahyaria and Kochuta (2015) use the graph structure of DBPedia to rank the different label candidates • But - topics may contain different concepts and the graph structure of DBPedia as a knowledge structure is not terribly coherent … • Our approach : use pre-trained Word2Vec to spot which terms form semantic clusters and match those with Eurovoc
  • 31. Topics as concepts • Usage of W2V to help us detect different concepts within one topic by making use of the distance between terms • For example : “labour, farm, poultry, sheep, pig, land, family, income, holding, purchased” • Three concepts within one topic : • labour, farm, poultry, sheep, pig, land • family • income, holding, purchased
  • 32. Reconciliation • In order to perform the matching with Eurovoc, we are testing to • Either focus on the most “centroid” term from a concept and see how many match • Use the structure of Eurovoc for decision making (e.g. pick the term on the deepest level or which has the most non- descriptors attached to it)
  • 33.
  • 34.
  • 35.
  • 36.
  • 37. Image sources • https://www.flickr.com/photos/c_41/8570463977/ • https://www.flickr.com/photos/dearalice/3662244759/ • https://www.flickr.com/photos/48143042@N05/4418727909/ • https://www.flickr.com/photos/mswid/4153808015/ • https://www.flickr.com/photos/21161327@N04/38524477854 • https://www.flickr.com/photos/jonnygwilliams/14118085866/ • https://www.flickr.com/photos/silverstealth/2389690278/