Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Automatic Preservation Watch

923 views

Published on

At the iPres2013 conference in Lisbon, Portugal, in September 2013 Luís Faria, KEEP SOLUTIONS LDA, presented SCAPE work on monitoring of digital repositories and the tool, Scout, which has been developed in this connection. Scout is a web-based service that assists content holders in monitoring their digital repository and provides an ontological knowledge base for compiling the information needed to detect preservation risks and opportunities.

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

Automatic Preservation Watch

  1. 1. Luis  Faria  lfaria@keep.pt KEEP  SOLUTIONS  www.keep-­‐solu:ons.com Alan  Akbik,  Barbara  Sierman,  Marcel  Ras,  Miguel  Ferreira,  José  Carlos  Ramalho iPRES  2013 Lisbon,  September  2,  2013 Automa0c  Preserva0on  Watch Using  Informa-on  Extrac-on  on  the  Web
  2. 2. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 2 Why do we need monitoring?
  3. 3. Repository Format obsolescence Emerging technology Consumer trends New standards Organisation mission Bit rot Resource capability System availability Security breach Economical limitations Social and political factors Producer trends Organisation policies 3 Why do we need monitoring? Risks Opportunities
  4. 4. 60% 40% Yes but manual and adhoc None Risk Assessment Survey on: 4
  5. 5. Scout:  a  preserva-on  watch  system This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Monitors  aspects  of  the  world  to  detect  preserva:on  risks  and  opportuni:es 5
  6. 6. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 6 Information Sources • Format registries & software catalogues • Digital repositories & web archives • Organizational objectives • Experiments • Simulation • Human knowledge
  7. 7. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 7 Currently supported information sources • PRONOM • Repository content and events • Web archive content • Web archive renderability experiments • SCAPE Policy model
  8. 8. 8 Define triggers • Notify me when there are tools that can render the format X.
  9. 9. 9 Define triggers Simple query with templates
  10. 10. 10 Receive notifications Email HTTP Push API There  are  tools  that  can  render  format  X.
  11. 11. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Automa-c  Watch  Limita-ons 11 Machine readable data • Explicit and formal specified information • Controlled vocabulary • Ontology • All instances use same structure and set of values
  12. 12. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Case  study:  e-­‐Depot  coverage 12 0 100 200 300 400 500 600 40% 50% 60% 70% 80% 90% 100% % of journal titles Publishers Titles per publisher 97% publishers 1-10 titles
  13. 13. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). e-­‐journal  coverage  ques-ons 13 • Which  publisher  provides  which  journal  -tles • Publisher  changes: • Ceases  to  provide  journal • Transfers  journal  to  other  publisher(s) • Publishers  merge • Journal  changes: • Name  changes • ISSN  changes • Ceased  to  exist
  14. 14. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website!
  15. 15. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Where  is  this  informa-on? 14 “In 1991, two years before the merger with Reed, Elsevier acquired Pergamon Press in the UK.” “The Asia-Europe Foundation (ASEF) sold the Asia Europe Journal and transferred the copyright to its long-time partner Springer.” “Acta Chirurgica Iugoslavica is available free of charge as an Open Access journal on the Internet.” In the publisher website! Not machine readable!
  16. 16. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Informa-on  Extrac-on • Extract structural information from unstructured data • Pattern-based information extraction • Some training and supervision may be needed 15 “[X] acquired [Y]”
  17. 17. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Experiment 1. Data acquisition and pre-processing 2. Relation discovery 3. Information extraction 4. Validation of results 16
  18. 18. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 1.  Data  acquisi-on  and  pre-­‐processing • Focused crawler with seed words (12.000 entries) • Publisher names • Journal titles ➡500.000 Web pages • Pre-process with NLP tools ➡18 million sentences ➡8 GB 17
  19. 19. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 18 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  20. 20. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 2.  Rela-on  discovery 19 Prominent pattern Rank [X] journal of [Y] 1 [X] published by [Y] 2 [X] journal on [Y] 3 [X] journal published by [Y] 4 [X] available as [Y] journal 5 PubMed [X] [Y] 9 [X] science proceedings of [Y] 25 [X] subscription available to [Y] 30
  21. 21. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 3.  Informa-on  extrac-on 20 2.000 journal titles 500 journal-publisher attributions
  22. 22. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). 4.  Valida-on  of  results 21 4% 10% 86% Journal titles in eDepot 15% 50% 35% Title-publisher in the Keepers registry Should add Existing False-positives
  23. 23. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). False-­‐posi-ves • Detecting boundaries of titles and publisher names • Using abbreviations on titles and publisher names • Technical problems like encoding 22 “European Journal of Nuclear Medicine and Molecular Imaging” IAAE - “International Association of Agricultural Economists” “├ó╦å┼buda University”
  24. 24. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Conclusions • We need data to support digital preservation • Explicit and formal specified for automation • Registries tend to be incomplete and outdated • Information Extraction Technologies can help • Still, some supervision may be needed 23
  25. 25. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Send  us  your  use  cases! 24 Alan Akbik alan.akbik@tu-berlin.de Luis Faria lfaria@keep.pt Preservation Watch What risks to monitor? Information Extraction What to extract from the web?
  26. 26. This  work  was  par,ally  supported  by  the  SCAPE  Project. The  SCAPE  project  is  co-­‐funded  by  the  European  Union  under  FP7  ICT-­‐2009.4.1  (Grant  Agreement  number  270137). Thank  you,  ques-ons? • Scout - a preservation watch system • Site: http://openplanets.github.io/scout/ • Demo: http://scout.scape.keep.pt • SCAPE Planning and Watch suite iPRES poster • http://bit.ly/scape-pw • SCAPE • http://www.scape-project.eu 25

×