SlideShare a Scribd company logo
1 of 29
Download to read offline
Design of a Helpdesk System for Archival
Infrastructures – An Information Retrieval approach
Kepa J. Rodriguez (SUB-Göttingen)
Talk at the Göttingen Center for Digital Humanities (GCDH)
CONNECTING COLLECTIONS
21/05/2013
14
Outline

The EHRI project

Motivation

Analysis of user requests

Representation of archival institutions in the system

Relevance of a institution to answer a query

Experiments

A first prototype: presentation and demo

Further work
14
The EHRI project

EHRI – European Holocaust Research Infrastructure –
aims to build a virtual research environment for
Holocaust research

20 institutions of 13 countries are involved in the
development

EHRI aims to integrate collection descriptions of more
than 1.000 archives, museums, memorials and other
institutions

More information: http://www.ehri-project.eu
14
Why a helpdesk?

Often historians get useful information directly from
archivists and not from catalogs and finding aids.

In Shoah research the material is very dispersed: before
archivists are contacted the researcher needs to identify
the institution.

One of the goals of EHRI and other European projects is
to put in contact researchers and collection holding
institutions.

Partial solution: EHRI will provide an integrated help
desk to support users in the identification of the
institution.
14
What is a helpdesk?

In the authomatic helpdesk:

The user writes this problem as an e-mail,

She/he sends it to the helpdesk

The system analyzes the problem and proposes
suitable archival institutions.
14
Analyisis of user queries: user requests
Data set: 400 e-mails send to the NIOD-Amsterdam.
(Institute for War, Holocaust and Genocide Studies)
Most habitual user queries are:

Information requests about individual persons or
families.

Information requests about concentration camps,
ghettos, KZ and detention facilities. Forced labor.

Information requests about documents.

Information requests about places.

Information requests about press, radios and
underground press.
14
Analyisis of user queries: provided information (1)
Data about persons

Name, family name, initials

Biographic data: profession, position in companies, role
in the army, resistance, antifascist and fascist
organizations.

Membership in political organizations, in resistance, etc.

Dates of birth, travel, arrest, deportation, execution,
death, exile.

Places of birth, residence, travel, arrest, deportation,
execution, death, exile.

Citizenship
14
Analyisis of user queries: provided information (2)
Data about facilities and transports

Place of KZ, detention facilities, ghettos.

Route of transports, stations.

Dates of transports.
Historical events:

Described by groups of words.

Often defined by verbs or nominalizations:
uprising, assassination, bombardment, etc.
14
Proposed solution
The help desk:

Knows all necessary things about institutions.

How are institutions represented?

Understands the needs of the user.

How is the query analyzed and represented?

Gives feedback to the user about relevant
institutions.

How is relevance computed?
14
How can we define an institution for our purposes?

An institution is defined by:

Its collections, files, documents.

Its controlled vocabularies.

Its profile.
14
…. and represent it? … a short formal introduction
First of all an introduction to vector spaces
Example: bi-dimensional vector space
We use a larger space: 71.000 dimensions or more
14
Representation of institutions in vector space

Collections, file descriptions, controlled vocabularies
are sets of words.

Features of the space are (selected) word lemmas

Features are extracted using natural language
processing

Text are tokenized and words identified

Each word is annotated with a Part of Speech

Each word is annotated with a lemma

Words with a selected list of POS tags are
candidates for being a feature
14
Representation of institutions in vector space (1)
Example of NLP processed text
......
including VVG include
circulars NNS circular
from IN from
the DT the
Central NP Central
Refugee NP Refugee
Committee NP Committee
on IN on
the DT the
enlistment NN enlistment
of IN of
tradesmen NNS tradesman
into IN into
the DT the
army NN army
......
14
Representation of institutions in vector space (2)

Each feature is associated with a value/weight

Two functions used for the experiments

TF-IDF: Term frequency / Inverse document
frequency.

Okapi-BM25.

User query: the weight is the term frequency
14
Representation of institutions in vector space (3)
Example of function: TF-IDF
14
Representation of institutions in vector space (4)
Example of vector (fragment)
......
activist 0.07296286111157319
answer 0.14467748398012975
welfare 0.04822582799337658
job 0.04822582799337658
number 0.04822582799337658
regard 0.5107400277810122
emigration 0.07296286111157319
create 0.04822582799337658
include 0.2553700138905061
soldier 0.04822582799337658
editor 0.04822582799337658
ownership 0.07296286111157319
poster 0.04822582799337658
branch 0.04822582799337658
disclosure 0.04822582799337658
underground0.07296286111157319
newspaper 0.43403245194038925
press 0.04822582799337658
......
14
Relevance: how can it be measured?
We can measure relevance as:

Angle between the vectors (e.g. Cosine)

Distance between the end-points of the vector

Etc.
14
Experiments: task

Collection descriptions were used to build a
vector space model.

Institutions provide us e-mails with information
requests

The institution was able to give the requested
information.

The information was related with their
collections, not with administrative issues.

Goal: Find out, to which institution an e-mail has
been sent
14
Experiments: data sets

Learning set: collection and file descriptions of

Wiener Library: 553 documents / 84,918
words

NIOD: 1929 documents / 2,095,402 words
(English translation)

Yad Vashem: 17,086 documents / 1,491,858
words

Jewish Museum Prag: 1 document / 10,825
words

Test set: e-mails sent to:

NIOD: 15 e-mails / 2759 words

Yad Vashem: 21 e-mails / 3900 words

Wiener Library: 30 e-mails / 2342 words
14
Experiments: Functions and metrics

Features ranked using

TF-IDF

Okapi-BM25

Similarity computed using

scalar product

cosine similarity

Euclidean distance

Evaluation: Precision, Recall, F1
14
Experiments: evaluation
14
Experiments: result
Using cosine similarity we obtain:
F1 =~ 0.7 – 0.8
Several cases are not real errors:

More than one institution is able to answer the
question
14
Difficulties

Often more than one institution is able to answer a user
request

That is good for the system, but makes the evaluation
difficult

With more data we can do a relevance ranking based
evaluation.

Cases of very unspecific requests can be answered by
almost all institutions.

Example: “I am looking for transportations list from
Bratislava to Auschwitz, happens on march 1942
witch concern my mother. May I ask you for help?“
14
Difficulties

Multilinguality

Collection and file descriptions are in different
languages

Machine translation is necessary

For the experiments, this was done by hand with
Google

A language identification modul will be necessary.

We will integrate collection descriptions, but very
interesting information is in levels not accesssible for EHRI

File level description

Documents
14
Prototype

Implementation:

Web application writen in Java

Maven, Spring framework

NLP based feature extraction

Tokenizer: self developed

POS-tagger and lemmatizer

TreeTagger
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

Extracted lemmas of words with POS Noun, Proper noun, Verb

Ranking function: TF-IDF

Similarity: Cosine
14
Next steps (1)

Explore use of other sources of knowledge

Authority files: names of persons, places...

Institution descriptions

Based on the data implement lists and dictionaries
for:

Synonimy:

Nazi, nationalsozialist and national-sozialist
are different features

Alternative spellings and misspellings

Text pre- and post- processing
14
Next steps (2)

Multilinguality:

Use APIs for automatic translation

Implement a language identification modul

Use the ranking of relevance

For evaluation

To present results to the user

Find a threshold to define when results are
irrelevant and user needs to give more details
NIOD Institute for War, Holocaust and
Genocide Studies (NL)
 
CEGES-SOMA Centre for Historical Research
and Documentation on War and Contemporary
Society (BE)
 
Jewish Museum in Prague (CZ)
 
Institute of Contemporary History Munich – Berlin
(DE)
 
YAD VASHEM The Holocaust Martyrs’ and
Heroes’ Remembrance Authority (IL)
 
The Wiener Library – Institute of Contemporary
History (UK)
 
Holocaust Memorial Center (HU)
 
HL-senteret Center for Studies of Holocaust
and Religious Minorities (NO)
 
NAF National Archives of Finland (FI)
 
The Emanuel Ringelblum Jewish Historical Institute (PL)
King’s College London (UK)
 
Georg-August-Universität Göttingen – Göttingen
State and University Library (DE)
 
Athena RC/IMIS (GR)
 
DANS Data Archiving and Networked Services (NL)
 
Shoah Memorial, Museum, Center for Contemporary
Jewish Documentation (FR)
 
ITS International Tracing Service (DE)
 
Memorial to the Murdered Jews of Europe (DE)
 
Terezín Memorial (CZ)
 
Beit Theresienstadt (IL)
 
VWI Vienna Wiesenthal Institute for
Holocaust Studies (AT)
CONNECTING KNOWLEDGE
CONNECTING COLLECTIONS
Design of a Helpdesk System for Archival
Infrastructures – An Information Retrieval approach
21/05/2013

More Related Content

Viewers also liked

Help Desk Ticketing System - Requirements Specification
Help Desk Ticketing System - Requirements SpecificationHelp Desk Ticketing System - Requirements Specification
Help Desk Ticketing System - Requirements SpecificationKyle Thompson
 
Writing Chapters 1, 2, 3 of the Capstone Project Proposal Manuscript
Writing Chapters 1, 2, 3 of the Capstone Project Proposal ManuscriptWriting Chapters 1, 2, 3 of the Capstone Project Proposal Manuscript
Writing Chapters 1, 2, 3 of the Capstone Project Proposal ManuscriptSheryl Satorre
 
Review of related literature
Review of related literatureReview of related literature
Review of related literatureBean Malicse
 
Chapter 2-Realated literature and Studies
Chapter 2-Realated literature and StudiesChapter 2-Realated literature and Studies
Chapter 2-Realated literature and StudiesMercy Daracan
 
Thesis in IT Online Grade Encoding and Inquiry System via SMS Technology
Thesis in IT Online Grade Encoding and Inquiry System via SMS TechnologyThesis in IT Online Grade Encoding and Inquiry System via SMS Technology
Thesis in IT Online Grade Encoding and Inquiry System via SMS TechnologyBelLa Bhe
 

Viewers also liked (9)

thesis in English
thesis in Englishthesis in English
thesis in English
 
Help Desk Ticketing System - Requirements Specification
Help Desk Ticketing System - Requirements SpecificationHelp Desk Ticketing System - Requirements Specification
Help Desk Ticketing System - Requirements Specification
 
Writing Chapters 1, 2, 3 of the Capstone Project Proposal Manuscript
Writing Chapters 1, 2, 3 of the Capstone Project Proposal ManuscriptWriting Chapters 1, 2, 3 of the Capstone Project Proposal Manuscript
Writing Chapters 1, 2, 3 of the Capstone Project Proposal Manuscript
 
Review of related literature
Review of related literatureReview of related literature
Review of related literature
 
Thesis elaine
Thesis elaineThesis elaine
Thesis elaine
 
Chapter 2-Realated literature and Studies
Chapter 2-Realated literature and StudiesChapter 2-Realated literature and Studies
Chapter 2-Realated literature and Studies
 
Final na final thesis
Final na final thesisFinal na final thesis
Final na final thesis
 
My thesis proposal
My thesis proposalMy thesis proposal
My thesis proposal
 
Thesis in IT Online Grade Encoding and Inquiry System via SMS Technology
Thesis in IT Online Grade Encoding and Inquiry System via SMS TechnologyThesis in IT Online Grade Encoding and Inquiry System via SMS Technology
Thesis in IT Online Grade Encoding and Inquiry System via SMS Technology
 

Similar to Design and prototype of a Help Desk System for EHRI: an Information Retrieval approach

oerworldmap adolescence of a community platform
oerworldmap adolescence of a community platformoerworldmap adolescence of a community platform
oerworldmap adolescence of a community platformJan Neumann
 
OER World Map: Adolescence of a Community Platform
OER World Map: Adolescence of a Community PlatformOER World Map: Adolescence of a Community Platform
OER World Map: Adolescence of a Community PlatformJan Neumann
 
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian WorldA Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian WorldEwan Oglethorpe
 
HistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de LyonHistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de Lyondhlab
 
histoGraph for historians
histoGraph for historianshistoGraph for historians
histoGraph for historiansCUbRIK Project
 
The OER World Map: Adolescence of a Community Platform for the OER Movement
The OER World Map: Adolescence of a Community Platform for the OER MovementThe OER World Map: Adolescence of a Community Platform for the OER Movement
The OER World Map: Adolescence of a Community Platform for the OER MovementOpen Education Consortium
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applicationsVasileios Lampos
 
PARTHENOS Training - Macro Level Issues
PARTHENOS Training - Macro Level IssuesPARTHENOS Training - Macro Level Issues
PARTHENOS Training - Macro Level IssuesParthenos
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdRaphael Troncy
 
Building an Open Operations Room for the OER Community #opened16
Building an Open Operations Room for the OER Community #opened16Building an Open Operations Room for the OER Community #opened16
Building an Open Operations Room for the OER Community #opened16Robert Farrow
 
OntoSOC: S ociocultural K nowledge O ntology
OntoSOC:  S ociocultural  K nowledge  O ntology OntoSOC:  S ociocultural  K nowledge  O ntology
OntoSOC: S ociocultural K nowledge O ntology IJwest
 
Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications Victor de Boer
 
cycLearning: Cyc and Currious Cat as learning companion
cycLearning: Cyc and Currious Cat as learning companioncycLearning: Cyc and Currious Cat as learning companion
cycLearning: Cyc and Currious Cat as learning companionThe Open Education Consortium
 
Sdi, communities and social media
Sdi, communities and social mediaSdi, communities and social media
Sdi, communities and social mediaWirelessInfo
 
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...Europeana
 
Introductory Lecture Information Systems 2011.12
Introductory Lecture Information Systems 2011.12Introductory Lecture Information Systems 2011.12
Introductory Lecture Information Systems 2011.12Dr Mariann Hardey
 
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesdhlab
 

Similar to Design and prototype of a Help Desk System for EHRI: an Information Retrieval approach (20)

oerworldmap adolescence of a community platform
oerworldmap adolescence of a community platformoerworldmap adolescence of a community platform
oerworldmap adolescence of a community platform
 
OER World Map: Adolescence of a Community Platform
OER World Map: Adolescence of a Community PlatformOER World Map: Adolescence of a Community Platform
OER World Map: Adolescence of a Community Platform
 
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian WorldA Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
A Paradox but a Possibility: Modern Data Technologies in the Humanitarian World
 
HistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de LyonHistoGraph presentation Insa de Lyon
HistoGraph presentation Insa de Lyon
 
Building a Citizen Engaged Research Project
Building a Citizen Engaged Research ProjectBuilding a Citizen Engaged Research Project
Building a Citizen Engaged Research Project
 
histoGraph for historians
histoGraph for historianshistoGraph for historians
histoGraph for historians
 
The OER World Map: Adolescence of a Community Platform for the OER Movement
The OER World Map: Adolescence of a Community Platform for the OER MovementThe OER World Map: Adolescence of a Community Platform for the OER Movement
The OER World Map: Adolescence of a Community Platform for the OER Movement
 
Topic models, vector semantics and applications
Topic models, vector semantics and applicationsTopic models, vector semantics and applications
Topic models, vector semantics and applications
 
IMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven KrauwerIMPACT Final Conference - Steven Krauwer
IMPACT Final Conference - Steven Krauwer
 
PARTHENOS Training - Macro Level Issues
PARTHENOS Training - Macro Level IssuesPARTHENOS Training - Macro Level Issues
PARTHENOS Training - Macro Level Issues
 
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the CrowdMediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
MediaFinder: Collect, Enrich and Visualize Media Memes Shared by the Crowd
 
Building an Open Operations Room for the OER Community #opened16
Building an Open Operations Room for the OER Community #opened16Building an Open Operations Room for the OER Community #opened16
Building an Open Operations Room for the OER Community #opened16
 
OntoSOC: S ociocultural K nowledge O ntology
OntoSOC:  S ociocultural  K nowledge  O ntology OntoSOC:  S ociocultural  K nowledge  O ntology
OntoSOC: S ociocultural K nowledge O ntology
 
Linked Open Data and Applications
Linked Open Data and Applications Linked Open Data and Applications
Linked Open Data and Applications
 
cycLearning: Cyc and Currious Cat as learning companion
cycLearning: Cyc and Currious Cat as learning companioncycLearning: Cyc and Currious Cat as learning companion
cycLearning: Cyc and Currious Cat as learning companion
 
Sdi, communities and social media
Sdi, communities and social mediaSdi, communities and social media
Sdi, communities and social media
 
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...
Europeana Cloud - Work Package 1: Assessing Researcher Needs in the Cloud and...
 
Dh presentation 2018
Dh presentation 2018Dh presentation 2018
Dh presentation 2018
 
Introductory Lecture Information Systems 2011.12
Introductory Lecture Information Systems 2011.12Introductory Lecture Information Systems 2011.12
Introductory Lecture Information Systems 2011.12
 
Humanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanitiesHumanist machine interaction for the digital humanities
Humanist machine interaction for the digital humanities
 

More from Kepa J. Rodriguez

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesKepa J. Rodriguez
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!Kepa J. Rodriguez
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationKepa J. Rodriguez
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchKepa J. Rodriguez
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionKepa J. Rodriguez
 

More from Kepa J. Rodriguez (6)

LOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish StudiesLOD4JS - Linked Open Data for Jewish Studies
LOD4JS - Linked Open Data for Jewish Studies
 
Use case: data edited as a book !!!
Use case: data edited as a book !!!Use case: data edited as a book !!!
Use case: data edited as a book !!!
 
Building a 3-gram model for Language Identification
Building a 3-gram model for Language IdentificationBuilding a 3-gram model for Language Identification
Building a 3-gram model for Language Identification
 
Information Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical ResearchInformation Extraction on Noisy Texts for Historical Research
Information Extraction on Noisy Texts for Historical Research
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora Resolution
 
Cross Document Coreference
Cross Document CoreferenceCross Document Coreference
Cross Document Coreference
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Design and prototype of a Help Desk System for EHRI: an Information Retrieval approach

  • 1. Design of a Helpdesk System for Archival Infrastructures – An Information Retrieval approach Kepa J. Rodriguez (SUB-Göttingen) Talk at the Göttingen Center for Digital Humanities (GCDH) CONNECTING COLLECTIONS 21/05/2013
  • 2. 14 Outline  The EHRI project  Motivation  Analysis of user requests  Representation of archival institutions in the system  Relevance of a institution to answer a query  Experiments  A first prototype: presentation and demo  Further work
  • 3. 14 The EHRI project  EHRI – European Holocaust Research Infrastructure – aims to build a virtual research environment for Holocaust research  20 institutions of 13 countries are involved in the development  EHRI aims to integrate collection descriptions of more than 1.000 archives, museums, memorials and other institutions  More information: http://www.ehri-project.eu
  • 4. 14 Why a helpdesk?  Often historians get useful information directly from archivists and not from catalogs and finding aids.  In Shoah research the material is very dispersed: before archivists are contacted the researcher needs to identify the institution.  One of the goals of EHRI and other European projects is to put in contact researchers and collection holding institutions.  Partial solution: EHRI will provide an integrated help desk to support users in the identification of the institution.
  • 5. 14 What is a helpdesk?  In the authomatic helpdesk:  The user writes this problem as an e-mail,  She/he sends it to the helpdesk  The system analyzes the problem and proposes suitable archival institutions.
  • 6. 14 Analyisis of user queries: user requests Data set: 400 e-mails send to the NIOD-Amsterdam. (Institute for War, Holocaust and Genocide Studies) Most habitual user queries are:  Information requests about individual persons or families.  Information requests about concentration camps, ghettos, KZ and detention facilities. Forced labor.  Information requests about documents.  Information requests about places.  Information requests about press, radios and underground press.
  • 7. 14 Analyisis of user queries: provided information (1) Data about persons  Name, family name, initials  Biographic data: profession, position in companies, role in the army, resistance, antifascist and fascist organizations.  Membership in political organizations, in resistance, etc.  Dates of birth, travel, arrest, deportation, execution, death, exile.  Places of birth, residence, travel, arrest, deportation, execution, death, exile.  Citizenship
  • 8. 14 Analyisis of user queries: provided information (2) Data about facilities and transports  Place of KZ, detention facilities, ghettos.  Route of transports, stations.  Dates of transports. Historical events:  Described by groups of words.  Often defined by verbs or nominalizations: uprising, assassination, bombardment, etc.
  • 9. 14 Proposed solution The help desk:  Knows all necessary things about institutions.  How are institutions represented?  Understands the needs of the user.  How is the query analyzed and represented?  Gives feedback to the user about relevant institutions.  How is relevance computed?
  • 10. 14 How can we define an institution for our purposes?  An institution is defined by:  Its collections, files, documents.  Its controlled vocabularies.  Its profile.
  • 11. 14 …. and represent it? … a short formal introduction First of all an introduction to vector spaces Example: bi-dimensional vector space We use a larger space: 71.000 dimensions or more
  • 12. 14 Representation of institutions in vector space  Collections, file descriptions, controlled vocabularies are sets of words.  Features of the space are (selected) word lemmas  Features are extracted using natural language processing  Text are tokenized and words identified  Each word is annotated with a Part of Speech  Each word is annotated with a lemma  Words with a selected list of POS tags are candidates for being a feature
  • 13. 14 Representation of institutions in vector space (1) Example of NLP processed text ...... including VVG include circulars NNS circular from IN from the DT the Central NP Central Refugee NP Refugee Committee NP Committee on IN on the DT the enlistment NN enlistment of IN of tradesmen NNS tradesman into IN into the DT the army NN army ......
  • 14. 14 Representation of institutions in vector space (2)  Each feature is associated with a value/weight  Two functions used for the experiments  TF-IDF: Term frequency / Inverse document frequency.  Okapi-BM25.  User query: the weight is the term frequency
  • 15. 14 Representation of institutions in vector space (3) Example of function: TF-IDF
  • 16. 14 Representation of institutions in vector space (4) Example of vector (fragment) ...... activist 0.07296286111157319 answer 0.14467748398012975 welfare 0.04822582799337658 job 0.04822582799337658 number 0.04822582799337658 regard 0.5107400277810122 emigration 0.07296286111157319 create 0.04822582799337658 include 0.2553700138905061 soldier 0.04822582799337658 editor 0.04822582799337658 ownership 0.07296286111157319 poster 0.04822582799337658 branch 0.04822582799337658 disclosure 0.04822582799337658 underground0.07296286111157319 newspaper 0.43403245194038925 press 0.04822582799337658 ......
  • 17. 14 Relevance: how can it be measured? We can measure relevance as:  Angle between the vectors (e.g. Cosine)  Distance between the end-points of the vector  Etc.
  • 18. 14 Experiments: task  Collection descriptions were used to build a vector space model.  Institutions provide us e-mails with information requests  The institution was able to give the requested information.  The information was related with their collections, not with administrative issues.  Goal: Find out, to which institution an e-mail has been sent
  • 19. 14 Experiments: data sets  Learning set: collection and file descriptions of  Wiener Library: 553 documents / 84,918 words  NIOD: 1929 documents / 2,095,402 words (English translation)  Yad Vashem: 17,086 documents / 1,491,858 words  Jewish Museum Prag: 1 document / 10,825 words  Test set: e-mails sent to:  NIOD: 15 e-mails / 2759 words  Yad Vashem: 21 e-mails / 3900 words  Wiener Library: 30 e-mails / 2342 words
  • 20. 14 Experiments: Functions and metrics  Features ranked using  TF-IDF  Okapi-BM25  Similarity computed using  scalar product  cosine similarity  Euclidean distance  Evaluation: Precision, Recall, F1
  • 22. 14 Experiments: result Using cosine similarity we obtain: F1 =~ 0.7 – 0.8 Several cases are not real errors:  More than one institution is able to answer the question
  • 23. 14 Difficulties  Often more than one institution is able to answer a user request  That is good for the system, but makes the evaluation difficult  With more data we can do a relevance ranking based evaluation.  Cases of very unspecific requests can be answered by almost all institutions.  Example: “I am looking for transportations list from Bratislava to Auschwitz, happens on march 1942 witch concern my mother. May I ask you for help?“
  • 24. 14 Difficulties  Multilinguality  Collection and file descriptions are in different languages  Machine translation is necessary  For the experiments, this was done by hand with Google  A language identification modul will be necessary.  We will integrate collection descriptions, but very interesting information is in levels not accesssible for EHRI  File level description  Documents
  • 25. 14 Prototype  Implementation:  Web application writen in Java  Maven, Spring framework  NLP based feature extraction  Tokenizer: self developed  POS-tagger and lemmatizer  TreeTagger http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/  Extracted lemmas of words with POS Noun, Proper noun, Verb  Ranking function: TF-IDF  Similarity: Cosine
  • 26. 14 Next steps (1)  Explore use of other sources of knowledge  Authority files: names of persons, places...  Institution descriptions  Based on the data implement lists and dictionaries for:  Synonimy:  Nazi, nationalsozialist and national-sozialist are different features  Alternative spellings and misspellings  Text pre- and post- processing
  • 27. 14 Next steps (2)  Multilinguality:  Use APIs for automatic translation  Implement a language identification modul  Use the ranking of relevance  For evaluation  To present results to the user  Find a threshold to define when results are irrelevant and user needs to give more details
  • 28. NIOD Institute for War, Holocaust and Genocide Studies (NL)   CEGES-SOMA Centre for Historical Research and Documentation on War and Contemporary Society (BE)   Jewish Museum in Prague (CZ)   Institute of Contemporary History Munich – Berlin (DE)   YAD VASHEM The Holocaust Martyrs’ and Heroes’ Remembrance Authority (IL)   The Wiener Library – Institute of Contemporary History (UK)   Holocaust Memorial Center (HU)   HL-senteret Center for Studies of Holocaust and Religious Minorities (NO)   NAF National Archives of Finland (FI)   The Emanuel Ringelblum Jewish Historical Institute (PL) King’s College London (UK)   Georg-August-Universität Göttingen – Göttingen State and University Library (DE)   Athena RC/IMIS (GR)   DANS Data Archiving and Networked Services (NL)   Shoah Memorial, Museum, Center for Contemporary Jewish Documentation (FR)   ITS International Tracing Service (DE)   Memorial to the Murdered Jews of Europe (DE)   Terezín Memorial (CZ)   Beit Theresienstadt (IL)   VWI Vienna Wiesenthal Institute for Holocaust Studies (AT) CONNECTING KNOWLEDGE
  • 29. CONNECTING COLLECTIONS Design of a Helpdesk System for Archival Infrastructures – An Information Retrieval approach 21/05/2013