Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Managing the Digitization of Large Press Archives

949 views

Published on

From the 2014 DLF Forum in Atlanta, GA.

Session Leaders
Bassem Elsayed, Bibliotheca Alexandrina
Ahmed Samir, Bibliotheca Alexandrina

Managing the digitization of press material is quite a challenge; not only in terms of quantity, but also in terms of text and material quality, designing the workflow system which organizes the operations, and handling the metadata. This challenge has been the focus of the Bibliotheca Alexandrina’s digitization work during the past year in the course of its partnership with the Center for Economic, Judicial, and Social Study and Documentation (CEDEJ). Having more than 800,000 pages of press articles to be digitally preserved and publicly accessed, triggered an inevitable need to design a workflow that can manage such a massive collection and handle its attributes proficiently. The deployment of this endeavor required simultaneous intervention of four main aspects; data analysis of the collection, developing a digitization workflow for the collection at hand, implementing and installing the necessary software tools for metadata entry, and finally, publishing the digital archive online for researchers and public access.

The presentation will demonstrate the workflow system which is being implemented to manage this massive press collection, which has yielded to date more than 400,000 pages. It will shed some light on the BA’s Digital Assets Factory (DAF), which is the nucleus upon which the digitization process of CEDEJ collection has been built. Additionally, the presentation will discuss the tools implemented for ingesting data into the digitization process starting form indexing until the creation of batches that are ingested into the system. The outflow will also be discussed in terms of organizing and grouping multipart press clips, in addition to the reviewing, validation and correction of the output. Light will also be shed on the challenges encountered to associate the accessible online archive with a powerful search engine supporting multidimensional search while maintaining a user-friendly navigation experience.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Managing the Digitization of Large Press Archives

  1. 1. The New Library of Alexandria Overview Bibliotheca Alexandrina (BA)
  2. 2. Ø Center of excellence in the production and dissemination of knowledge Ø Place of dialogue, learning and understanding between cultures and peoples
  3. 3. Ø The World’s Window on Egypt Ø Egypt’s Window on the World Ø Instrument for Rising to the Challenges of the Digital Age Ø Center for Dialogue Between Peoples and Civilizations
  4. 4. Not just a Library of Books but rather a vast cultural and scientific complex
  5. 5. A library that can accommodate millions of books
  6. 6. 7 http://archive.bibalex.org
  7. 7. 8
  8. 8. 14
  9. 9. 15 http://descegy.bibalex.org
  10. 10. 16 http://lartarab.bibalex.org
  11. 11. 17 More than 230,000 Arabic books are freely available online for Arabic readers worldwide
  12. 12. 18 http://suezcanal.bibalex.org
  13. 13. 19
  14. 14. 20 http://naguib.bibalex.org/
  15. 15. 21 http://nasser.bibalex.org
  16. 16. 22 http://sadat.bibalex.org
  17. 17. Ø Project Overview Ø Collection Overview Ø Data Representation Ø System Workflow — DAF (Digital Assets Factory) — Cataloguing — Website § Solr search Engine § Article Viewer 24
  18. 18. 25
  19. 19. Ø Centre for Economic, Judicial, and Social Study and Documentation (CEDEJ) collaborated with Bibliotheca Alexandrina (BA) for the digitization of its archive of massive press articles collection Ø The project consists of multiple modules to: — Index the Press Archive Collection — Control data entry workflow — Digitize and process data — Catalogue and review Articles — Archive Web Publishing 26
  20. 20. 27
  21. 21. Ø Package of press archive — 800,000+ press clips varying between § Press § Reports — 500+ publishers — 60,000+ writers and reporters — 200 Different subjects § Economic, politics, social life, etc… — Archive Languages: § Arabic, English and French — Date range from 1966 to 2009 28
  22. 22. Ø Finished so far — 115,000 press clips varying between § Press § Reports — 200 publishers — 14,000 writers and reporters — 100 Different subjects § Economic, politics, social life, etc… — Archive Languages: § Arabic, English and French — Date range from 1966 to 2009 29
  23. 23. 30
  24. 24. Ø A list of packaged press archive is submitted to Bibliotheca Alexandrina to be scanned and catalogued Ø Source of data is a collection of boxes Ø The box is organized on the following hierarchy — Folder — File — Sub-File — Document Ø Document represents a single page of press 31
  25. 25. 32
  26. 26. 33
  27. 27. 34
  28. 28. 35
  29. 29. 36
  30. 30. 37
  31. 31. 38
  32. 32. Article Creation 39
  33. 33. Article Metadata 40
  34. 34. Lookups Management 41
  35. 35. Reports 42
  36. 36. 43
  37. 37. 44
  38. 38. 45
  39. 39. Ø Based on Apache Lucene project v4.1 Ø SolrNet API is used to connect to Solr server Ø Features — Simple/Advanced search — Results Highlighting — Fields AutoComplete — Text search (Article Viewer) 46
  40. 40. 47
  41. 41. 48
  42. 42. 49
  43. 43. 50
  44. 44. 51
  45. 45. 52
  46. 46. 53
  47. 47. Ø Article viewer is used for previewing articles — It is one of multiple viewers developed at BA Ø Architecture — Server Side: RESTful services — Client Side: JavaScript using JSONP Ø Features — Image preview — Metadata preview — Text selection — Searching/highlighting — Zooming options: fit width/height 54
  48. 48. Ø Viewer Web Services — Metadata Web Service: § Retrieve article catalogue metadata § Return technical information (width, height, page count..) — Content Web Service: § Retrieve the image of each single page in the article applying scaling to custom width and height responsively § Return the selected text based on the user highlighted area — Search Web Service: § Perform the search using Solr engine APIs in the content of the articles § Highlight the matching phrases in the article image 55
  49. 49. 56
  50. 50. 57
  51. 51. 58

×