Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)

83 views

Published on

Lena Hessel and Matthias Arnold presented the ECPO project at the E-Science-Days Heidelberg [https://e-science-tage.de/en/]. The presentation focused on the agents service and the approaches towards Document Layout Analysis and encoding fulltext in TEI XML.
From the abstract:
Our new cross-database agent service allows us to manage the approximately 47.000 names recorded in WoMag and ECPO: a) merge identical names across databases, b) identify agents and assigning names to them, and c) link agent records to authority data (GND, VIAF, Wikidata). Besides creating a curated list of agents occurring in the publications, we also aim to add missing persons to authority files like the GND.
One crucial aspect ECPO is full text capability. Unfortunately, OCR software cannot be used out-of-the-box, for a number of reasons: document analysis fails to recognize complex newspaper layout, character recognition fails when it faces emphasis marks next to characters, and recognized passages have to be grouped in the right semantic order.

Published in: Education
  • Be the first to comment

  • Be the first to like this

Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)

  1. 1. Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO) Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
  2. 2. Research data – Chinese periodical press • First decades of the 20th century • Understudied, but dominated the contemporary print market and provide access to the "actual culture“ (R. Williams, 1961) • Challenges: • Physically dispersed, often poorly preserved • Voluminous (full runs, daily, up to >30 years) • Multi-generic and intellectually demanding • Approach • Multi-disciplinary team, >10 researchers • Women and the Periodical Press in China’s Global Twentieth Century: A Space of Their Own? Ed. by Joan Judge, Barbara Mittler and Michel Hockx, Cambridge University Press, 2018. • Database Early Chinese Periodicals Online (ECPO)
  3. 3. https://uni-heidelberg.de/ecpo
  4. 4. 276 publications: 134 with items
  5. 5. >279.000 scans
  6. 6. 40.936 issues: 46.931 articles, 20.532 images, 18.639 ads
  7. 7. Chart: Publication activity by year Arnold and Hessel | ECPO Database
  8. 8. Opening the data silo From static export to dynamic data service • Output data using the Metadata Object Description Schema (MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/ From static pre-rendered files to dynamic image service • Implementation of International Image Interoperability Framework (IIIF) Image API http://iiif.io/technical-details/ From separate names to cross-db agents service • Identify agent, assign names, link to authorities, structure information, feed data back to authority files (GND)
  9. 9. Agents Service Arnold and Hessel | ECPO Database
  10. 10. 47.245 agents, 163.408 occurrences, 15 ‘languages’
  11. 11. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  12. 12. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  13. 13. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  14. 14. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  15. 15. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  16. 16. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  17. 17. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  18. 18. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  19. 19. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF
  20. 20. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND
  21. 21. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF
  22. 22. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  23. 23. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  24. 24. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike Agents with references to authorities: VIAF: 861 Wikidata: 821 GND: 662 Baidu: 6 DBpedia: 5
  25. 25. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  26. 26. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  27. 27. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  28. 28. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  29. 29. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  30. 30. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  31. 31. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  32. 32. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  33. 33. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  34. 34. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  35. 35. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 Islington Corinthians F.C.: - Leonard Bradbury - Jack Braithwaite - Alec Buchanan - Pat Clark - George Dance - Cyril Longman - Harry Lowe - Richard Manning - Albert (Eddie) Martin - John Miller - William Miller - George Pearce - Bert Read - Johnny Sherwood - Dick Tarrant - Bill Whittaker - Ted Wingfield - J.K. Wright Source: National Library Board Singapore NewspaperSG, accessed March 25, 2019, http://eresources.nlb.gov.sg/new spapers/Digitised/Article/straitsti mes19371128-1.2.117.
  36. 36. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  37. 37. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  38. 38. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  39. 39. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  40. 40. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  41. 41. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  42. 42. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  43. 43. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  44. 44. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  45. 45. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  46. 46. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  47. 47. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  48. 48. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  49. 49. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  50. 50. Towards full text Arnold and Hessel | ECPO Database
  51. 51. https://uni-heidelberg.de/ecpo Arnold and Hessel | ECPO Database
  52. 52. Expanding data: towards fulltext • Manual typing not feasible • Professional double-keying very expensive • OCR often unusable • Document: dense layout, normal segmentation fails • Image: noisy, secondary copies with stains/scratches • Characters: special characters (emphasis), handwriting
  53. 53. ca. 63% correctly recognized
  54. 54. Segmentation - I • Page segmentation (pattern recognition/computer vision) • Analyze layout of page, use page-internal structures • Identify semantic units • Generate co-ordinates, relate them to items, store in DB
  55. 55. Segmentation - II • Page segmentation (crowdsourcing) • Pilot project with Pallas Ludens GmbH • Let the crowd help analyzing the pages • Identify and label four item types: − image/drawing − article − advertisement − additional information • Supervised • Non-Chinese speaking community!
  56. 56. Processing 2. Page segmentation (computer vision/ocr)
  57. 57. Grouping semantic units 2. Page segmentation (crowdsourcing) • drawing – correcting – grouping
  58. 58. Outcome of segmentation pilot 1. Page segmentation can be outsourced to expert crowd • Requires supervision • Advanced user interfaces (high usability, efficiency) • Crowd should read Chinese (semantic grouping) 2. Jingbao 晶報 1919-21 completely segmented with qualified boxes, issues of April 1919 with semantic units 3. Further processing: • Partnership with Computational Knowledge Lab (知識計 算實驗室), Department of Engineering Science and Ocean Engineering, Taiwan National University, http://www.cklab.org/ • Seeking additional partners for collaboration!
  59. 59. Chinese Republican Periodicals – Encoding full text in TEI Arnold and Hessel | ECPO Database
  60. 60. Materiality issues
  61. 61. Mark-up: Different character sizes <tagsDecl> <rendition scheme="css" selector="body p">font-size: 100%;</rendition> <rendition xml:id="half">font-size: 50%</rendition> <rendition xml:id="double">font-size: 200%</rendition> </tagsDecl> <hi rendition="#double">女子之於男子</hi> <hi rendition="#half">試觀西歐各國<lb/>名為男…
  62. 62. Mark-up: Emphasis In Japanese: “emphasis dots” 圏点 (kenten) or 傍点 1. ◦ U+25E6 “open dot” 2. • U+2022 “filled dot” 3. ○ U+25CB “open circle” 4. ● U+25CF “filled circle” 5. ◎ U+25CE “open double-circle” 6. ◉ U+25C9 “filled double-circle” 7. △ U+25B3 “open triangle” 8. ▲ U+25B2 “filled triangle” 9. ﹆ U+FE46 “open sesame” 10. ﹅ U+FE45 “filled sesame” https://drafts.csswg.org/css-text-decor-3/#text-emphasis-style-property BUT: emphasis characters mixed with punctuation, differentiation and exact recording is HUGE workload -> emphasis characters currently ignored
  63. 63. Mark-up: Spaces between some characters <space unit="chars" n="1"/> OR <gap unit="char" extent="1"> </gap> (with “ ” being U+3000) OR just use U+3000 without markup
  64. 64. TEI Example
  65. 65. Wrap-up Arnold and Hessel | ECPO Database
  66. 66. From data silo towards open data • Data collection = research data • Enhance metadata • Publishing information, content analysis (keywords) • Separation of meta-/data from user interface • FAIR Prinzipien • DOI records for publications (in progress), connect database to library catalogs • Publish material and metadata Open Access, images, publication metadata, and item metadata (article, image, ad) • Basic data API (MODS XML) open up IIIF manifests and Agents data (planned) • Publish metadata on heiDATA/Dataverse (Summer) Arnold and Hessel | ECPO Database
  67. 67. Wrap-up • Provide different ways to access data via frontend: • Search (all metadata and annotations) • Browse chronological (calendar) • Browse/search agents / keywords • Categories of publications • Agents service (biographic data) • cross-db record curation, connect persons with authorities • plan (2019): add missing agents or names to GND, pull additional data from authorities, develop agents API • Page segmentation – crowdsourcing possible, grouping requires Chinese, new tool creates web-annotations – seeking partner for automatic page analysis • Text – plan: process segments, generate full text, store TEI XML, crowd-based editing
  68. 68. ECPO in a larger context • Content expansion • Early western publications printed in China • Co-operation with Univ. Erlangen: Agents • ECPO as data platform • for storing, enhancing, accessing, sharing „grey“ material from the CATS Library • Outreach/ Communities • DH-d working group Newspaper/Journals, OCR-d, Transkribus/READ • Connect with FID Asien (CrossAsia), Non-Latn scripts interest group, TEI East Asia SIG • Long-term repository: University Library, HeiDATA/HeidICON Arnold and Hessel | ECPO Database
  69. 69. Contact Matthias Arnold – Lena Hessel Heidelberg Centre for Transcultural Studies | HCTS Karl Jaspers Centre Voßstr. 2 | Building 4400 | Room 005b 69115 Heidelberg, Germany Phone: +49 - 6221 - 54 4094 eMail: matthias.arnold@uni-hd.de Web: http://tinyurl.com/matthias-arnold

×