Design and prototype of a Help Desk System for EHRI: an Information Retrieval approach


Published on

User surveys of researchers and archivists realized at the EHRI project show that historians often find a significant amount of their archival sources not through finding aids, catalogs and research guides (online or not) but by talking directly to archivists.

In the case of very dispersed archival sources, an additional problem for researchers is to find the archival institution that can help them find useful material. That motivated the EHRI project to integrate an automatic helpdesk in the portal as a partial solution for this problem. This helpdesk should be able to interpret the user needs and compute the relevance of institutions to give help.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Design and prototype of a Help Desk System for EHRI: an Information Retrieval approach

  1. 1. Design of a Helpdesk System for ArchivalInfrastructures – An Information Retrieval approachKepa J. Rodriguez (SUB-Göttingen)Talk at the Göttingen Center for Digital Humanities (GCDH)CONNECTING COLLECTIONS21/05/2013
  2. 2. 14OutlineThe EHRI projectMotivationAnalysis of user requestsRepresentation of archival institutions in the systemRelevance of a institution to answer a queryExperimentsA first prototype: presentation and demoFurther work
  3. 3. 14The EHRI projectEHRI – European Holocaust Research Infrastructure –aims to build a virtual research environment forHolocaust research20 institutions of 13 countries are involved in thedevelopmentEHRI aims to integrate collection descriptions of morethan 1.000 archives, museums, memorials and otherinstitutionsMore information:
  4. 4. 14Why a helpdesk?Often historians get useful information directly fromarchivists and not from catalogs and finding aids.In Shoah research the material is very dispersed: beforearchivists are contacted the researcher needs to identifythe institution.One of the goals of EHRI and other European projects isto put in contact researchers and collection holdinginstitutions.Partial solution: EHRI will provide an integrated helpdesk to support users in the identification of theinstitution.
  5. 5. 14What is a helpdesk?In the authomatic helpdesk:The user writes this problem as an e-mail,She/he sends it to the helpdeskThe system analyzes the problem and proposessuitable archival institutions.
  6. 6. 14Analyisis of user queries: user requestsData set: 400 e-mails send to the NIOD-Amsterdam.(Institute for War, Holocaust and Genocide Studies)Most habitual user queries are:Information requests about individual persons orfamilies.Information requests about concentration camps,ghettos, KZ and detention facilities. Forced labor.Information requests about documents.Information requests about places.Information requests about press, radios andunderground press.
  7. 7. 14Analyisis of user queries: provided information (1)Data about personsName, family name, initialsBiographic data: profession, position in companies, rolein the army, resistance, antifascist and fascistorganizations.Membership in political organizations, in resistance, etc.Dates of birth, travel, arrest, deportation, execution,death, exile.Places of birth, residence, travel, arrest, deportation,execution, death, exile.Citizenship
  8. 8. 14Analyisis of user queries: provided information (2)Data about facilities and transportsPlace of KZ, detention facilities, ghettos.Route of transports, stations.Dates of transports.Historical events:Described by groups of words.Often defined by verbs or nominalizations:uprising, assassination, bombardment, etc.
  9. 9. 14Proposed solutionThe help desk:Knows all necessary things about institutions.How are institutions represented?Understands the needs of the user.How is the query analyzed and represented?Gives feedback to the user about relevantinstitutions.How is relevance computed?
  10. 10. 14How can we define an institution for our purposes?An institution is defined by:Its collections, files, documents.Its controlled vocabularies.Its profile.
  11. 11. 14…. and represent it? … a short formal introductionFirst of all an introduction to vector spacesExample: bi-dimensional vector spaceWe use a larger space: 71.000 dimensions or more
  12. 12. 14Representation of institutions in vector spaceCollections, file descriptions, controlled vocabulariesare sets of words.Features of the space are (selected) word lemmasFeatures are extracted using natural languageprocessingText are tokenized and words identifiedEach word is annotated with a Part of SpeechEach word is annotated with a lemmaWords with a selected list of POS tags arecandidates for being a feature
  13. 13. 14Representation of institutions in vector space (1)Example of NLP processed text......including VVG includecirculars NNS circularfrom IN fromthe DT theCentral NP CentralRefugee NP RefugeeCommittee NP Committeeon IN onthe DT theenlistment NN enlistmentof IN oftradesmen NNS tradesmaninto IN intothe DT thearmy NN army......
  14. 14. 14Representation of institutions in vector space (2)Each feature is associated with a value/weightTwo functions used for the experimentsTF-IDF: Term frequency / Inverse documentfrequency.Okapi-BM25.User query: the weight is the term frequency
  15. 15. 14Representation of institutions in vector space (3)Example of function: TF-IDF
  16. 16. 14Representation of institutions in vector space (4)Example of vector (fragment)......activist 0.07296286111157319answer 0.14467748398012975welfare 0.04822582799337658job 0.04822582799337658number 0.04822582799337658regard 0.5107400277810122emigration 0.07296286111157319create 0.04822582799337658include 0.2553700138905061soldier 0.04822582799337658editor 0.04822582799337658ownership 0.07296286111157319poster 0.04822582799337658branch 0.04822582799337658disclosure 0.04822582799337658underground0.07296286111157319newspaper 0.43403245194038925press 0.04822582799337658......
  17. 17. 14Relevance: how can it be measured?We can measure relevance as:Angle between the vectors (e.g. Cosine)Distance between the end-points of the vectorEtc.
  18. 18. 14Experiments: taskCollection descriptions were used to build avector space model.Institutions provide us e-mails with informationrequestsThe institution was able to give the requestedinformation.The information was related with theircollections, not with administrative issues.Goal: Find out, to which institution an e-mail hasbeen sent
  19. 19. 14Experiments: data setsLearning set: collection and file descriptions ofWiener Library: 553 documents / 84,918wordsNIOD: 1929 documents / 2,095,402 words(English translation)Yad Vashem: 17,086 documents / 1,491,858wordsJewish Museum Prag: 1 document / 10,825wordsTest set: e-mails sent to:NIOD: 15 e-mails / 2759 wordsYad Vashem: 21 e-mails / 3900 wordsWiener Library: 30 e-mails / 2342 words
  20. 20. 14Experiments: Functions and metricsFeatures ranked usingTF-IDFOkapi-BM25Similarity computed usingscalar productcosine similarityEuclidean distanceEvaluation: Precision, Recall, F1
  21. 21. 14Experiments: evaluation
  22. 22. 14Experiments: resultUsing cosine similarity we obtain:F1 =~ 0.7 – 0.8Several cases are not real errors:More than one institution is able to answer thequestion
  23. 23. 14DifficultiesOften more than one institution is able to answer a userrequestThat is good for the system, but makes the evaluationdifficultWith more data we can do a relevance ranking basedevaluation.Cases of very unspecific requests can be answered byalmost all institutions.Example: “I am looking for transportations list fromBratislava to Auschwitz, happens on march 1942witch concern my mother. May I ask you for help?“
  24. 24. 14DifficultiesMultilingualityCollection and file descriptions are in differentlanguagesMachine translation is necessaryFor the experiments, this was done by hand withGoogleA language identification modul will be necessary.We will integrate collection descriptions, but veryinteresting information is in levels not accesssible for EHRIFile level descriptionDocuments
  25. 25. 14PrototypeImplementation:Web application writen in JavaMaven, Spring frameworkNLP based feature extractionTokenizer: self developedPOS-tagger and lemmatizerTreeTaggerExtracted lemmas of words with POS Noun, Proper noun, VerbRanking function: TF-IDFSimilarity: Cosine
  26. 26. 14Next steps (1)Explore use of other sources of knowledgeAuthority files: names of persons, places...Institution descriptionsBased on the data implement lists and dictionariesfor:Synonimy:Nazi, nationalsozialist and national-sozialistare different featuresAlternative spellings and misspellingsText pre- and post- processing
  27. 27. 14Next steps (2)Multilinguality:Use APIs for automatic translationImplement a language identification modulUse the ranking of relevanceFor evaluationTo present results to the userFind a threshold to define when results areirrelevant and user needs to give more details
  28. 28. NIOD Institute for War, Holocaust andGenocide Studies (NL) CEGES-SOMA Centre for Historical Researchand Documentation on War and ContemporarySociety (BE) Jewish Museum in Prague (CZ) Institute of Contemporary History Munich – Berlin(DE) YAD VASHEM The Holocaust Martyrs’ andHeroes’ Remembrance Authority (IL) The Wiener Library – Institute of ContemporaryHistory (UK) Holocaust Memorial Center (HU) HL-senteret Center for Studies of Holocaustand Religious Minorities (NO) NAF National Archives of Finland (FI) The Emanuel Ringelblum Jewish Historical Institute (PL)King’s College London (UK) Georg-August-Universität Göttingen – GöttingenState and University Library (DE) Athena RC/IMIS (GR) DANS Data Archiving and Networked Services (NL) Shoah Memorial, Museum, Center for ContemporaryJewish Documentation (FR) ITS International Tracing Service (DE) Memorial to the Murdered Jews of Europe (DE) Terezín Memorial (CZ) Beit Theresienstadt (IL) VWI Vienna Wiesenthal Institute forHolocaust Studies (AT)CONNECTING KNOWLEDGE
  29. 29. CONNECTING COLLECTIONSDesign of a Helpdesk System for ArchivalInfrastructures – An Information Retrieval approach21/05/2013