Lena Hessel and Matthias Arnold presented the ECPO project at the E-Science-Days Heidelberg [https://e-science-tage.de/en/]. The presentation focused on the agents service and the approaches towards Document Layout Analysis and encoding fulltext in TEI XML.
From the abstract:
Our new cross-database agent service allows us to manage the approximately 47.000 names recorded in WoMag and ECPO: a) merge identical names across databases, b) identify agents and assigning names to them, and c) link agent records to authority data (GND, VIAF, Wikidata). Besides creating a curated list of agents occurring in the publications, we also aim to add missing persons to authority files like the GND.
One crucial aspect ECPO is full text capability. Unfortunately, OCR software cannot be used out-of-the-box, for a number of reasons: document analysis fails to recognize complex newspaper layout, character recognition fails when it faces emphasis marks next to characters, and recognized passages have to be grouped in the right semantic order.
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
1. Transforming data silos into knowledge:
Early Chinese Periodicals Online (ECPO)
Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
2. Research data – Chinese periodical press
• First decades of the 20th century
• Understudied, but dominated the contemporary print market and
provide access to the "actual culture“ (R. Williams, 1961)
• Challenges:
• Physically dispersed, often poorly preserved
• Voluminous (full runs, daily, up to >30 years)
• Multi-generic and intellectually demanding
• Approach
• Multi-disciplinary team, >10 researchers
• Women and the Periodical Press in China’s Global
Twentieth Century: A Space of Their Own? Ed. by Joan
Judge, Barbara Mittler and Michel Hockx, Cambridge
University Press, 2018.
• Database
Early Chinese Periodicals Online (ECPO)
11. Opening the data silo
From static export to dynamic data service
• Output data using the Metadata Object Description Schema
(MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/
From static pre-rendered files to dynamic image service
• Implementation of International Image Interoperability
Framework (IIIF) Image API http://iiif.io/technical-details/
From separate names to cross-db agents service
• Identify agent, assign names, link to authorities, structure
information, feed data back to authority files (GND)
14. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
15. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
16. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
17. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
18. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
19. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
20. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
21. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
22. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
23. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
24. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
25. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
26. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
27. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Agents with references to authorities:
VIAF: 861
Wikidata: 821
GND: 662
Baidu: 6
DBpedia: 5
28. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
29. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
30. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
31. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
32. Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
33. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
34. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
35. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
36. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
37. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
38. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Islington Corinthians F.C.:
- Leonard Bradbury
- Jack Braithwaite
- Alec Buchanan
- Pat Clark
- George Dance
- Cyril Longman
- Harry Lowe
- Richard Manning
- Albert (Eddie) Martin
- John Miller
- William Miller
- George Pearce
- Bert Read
- Johnny Sherwood
- Dick Tarrant
- Bill Whittaker
- Ted Wingfield
- J.K. Wright
Source: National Library Board
Singapore NewspaperSG,
accessed March 25, 2019,
http://eresources.nlb.gov.sg/new
spapers/Digitised/Article/straitsti
mes19371128-1.2.117.
39. Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
40. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
41. Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
42. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
43. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
44. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
45. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
46. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
47. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
48.
49. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
50. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
51. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
52. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
53. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
58. Expanding data: towards fulltext
• Manual typing not feasible
• Professional double-keying very expensive
• OCR often unusable
• Document: dense layout, normal segmentation fails
• Image: noisy, secondary copies with stains/scratches
• Characters: special characters (emphasis), handwriting
64. Segmentation - I
• Page segmentation (pattern recognition/computer vision)
• Analyze layout of page, use page-internal structures
• Identify semantic units
• Generate co-ordinates, relate them to items, store in DB
65. Segmentation - II
• Page segmentation (crowdsourcing)
• Pilot project with Pallas Ludens GmbH
• Let the crowd help analyzing the pages
• Identify and label four item types:
− image/drawing
− article
− advertisement
− additional information
• Supervised
• Non-Chinese speaking community!
68. Outcome of segmentation pilot
1. Page segmentation can be outsourced to expert crowd
• Requires supervision
• Advanced user interfaces (high usability, efficiency)
• Crowd should read Chinese (semantic grouping)
2. Jingbao 晶報 1919-21 completely segmented with qualified
boxes, issues of April 1919 with semantic units
3. Further processing:
• Partnership with Computational Knowledge Lab (知識計
算實驗室), Department of Engineering Science and
Ocean Engineering, Taiwan National University,
http://www.cklab.org/
• Seeking additional partners for collaboration!
73. Mark-up: Spaces between some characters
<space unit="chars" n="1"/>
OR
<gap unit="char" extent="1"> </gap>
(with “ ” being U+3000)
OR
just use U+3000 without markup
76. From data silo towards open data
• Data collection = research data
• Enhance metadata
• Publishing information, content analysis (keywords)
• Separation of meta-/data from user interface
• FAIR Prinzipien
• DOI records for publications (in progress), connect database
to library catalogs
• Publish material and metadata Open Access, images,
publication metadata, and item metadata (article, image, ad)
• Basic data API (MODS XML)
open up IIIF manifests and Agents data (planned)
• Publish metadata on heiDATA/Dataverse (Summer)
Arnold and Hessel | ECPO Database
77. Wrap-up
• Provide different ways to access data via frontend:
• Search (all metadata and annotations)
• Browse chronological (calendar)
• Browse/search agents / keywords
• Categories of publications
• Agents service (biographic data)
• cross-db record curation, connect persons with authorities
• plan (2019): add missing agents or names to GND, pull additional
data from authorities, develop agents API
• Page segmentation – crowdsourcing possible, grouping
requires Chinese, new tool creates web-annotations – seeking
partner for automatic page analysis
• Text – plan: process segments, generate full text, store TEI
XML, crowd-based editing
78. ECPO in a larger context
• Content expansion
• Early western publications printed in China
• Co-operation with Univ. Erlangen: Agents
• ECPO as data platform
• for storing, enhancing, accessing, sharing „grey“
material from the CATS Library
• Outreach/ Communities
• DH-d working group Newspaper/Journals, OCR-d,
Transkribus/READ
• Connect with FID Asien (CrossAsia), Non-Latn scripts
interest group, TEI East Asia SIG
• Long-term repository: University Library, HeiDATA/HeidICON
Arnold and Hessel | ECPO Database
79. Contact
Matthias Arnold – Lena Hessel
Heidelberg Centre for Transcultural Studies | HCTS
Karl Jaspers Centre
Voßstr. 2 | Building 4400 | Room 005b
69115 Heidelberg, Germany
Phone: +49 - 6221 - 54 4094
eMail: matthias.arnold@uni-hd.de
Web: http://tinyurl.com/matthias-arnold