Working digitally with
Historical Documents
Georg Vogler
@gvogeler
http://www.i-d-e.dehttp://informationsmodellierung.uni-graz.at Napoli, 25.9.2018
Bridge the distance between modern use and
historical production
„Digital Scholarly Edition“ „Historical Analysis“
People in the Past and their Activities
Humanities Scholars, in particular Historians
Archival Document
lists, databases, spreadsheets, reference works , ...
scholarly editionscholarly edition
index, regestaindex, regesta
TextsTexts
Word, PDF. HTML,
SVG, ...
csv, xslx, SQL ...
TEI
Digital Images
EAD/ RiC
RDFs, OWL
RDF
CIDOC-CRM, …
Interpretation
Presentation
Data analysis
Annotation
Scan /
Photographs
Description
Transformation
Conceptuali-
sation
Data creation
OCR/HTR,
Transcription
Metasource
(J.-Ph.Genet1994)
Bridge the distance between modern use and
historical production
„Digital Scholarly Edition“
• Select object
• Digitise the archival document
• Create full text
• Structure text
• Annotate / enrich with external
knowledge
• Convert text into structured data
„Historical Analysis“
• Modelling research question and
the data needed to answer the
question
• Select data
• Evaluating algorithms / tools to
process data
• Visualise / organise data in a
meaningful way
Computational Methods are advancing …
Digitisation
Human
• Selection of objects to be
digitized
• Decision on the appropriate
method
• Quality control
Machine
• Pixel representation
• OCR/HTR
• Make the documents available in
the internet
Digitisation: „Confidence“
50% 48%
http://prhlt-kws.prhlt.upv.es/himanis/?q=bavarie&t=50&r=
http://prhlt-kws.prhlt.upv.es/himanis/?q=bavarie&t=48&r=
Digitisation
Human
• Selection of objects to be
digitized
• Decision on the appropriate
method
• Quality control
• Integrate into scholarly
discourse
Machine
• Pixel representation
• Suggestions for layout
• Suggestion for transcriptions (by
training with human
transcriptions)
• Publish
Information Extraction?
Human Machine
• Named Entity Recognition
• „Topic Modelling“
Screenshot from the ChartEx annotation tool
ChartEx Annotation process (Brat)
Information Extraction
Human
• „semantic“ annotation
• „If you have my name, you still don‘t
know me.“
• Manual annotation
• Identifying (Imported / exported)
• Classification schemes
• Integrate into scholarly discourse
Machine
• Named Entity Recognition
• In modern texts
• Linguistic method
• „Topic Modelling“
• groups of words typical for a
specific text chunk
• Linguistic “surface”
The Human in the Loop
Digitization
• Sensoric representation
• Algorithmic conversion
• On the „linguistic surface“
Digital Edition
• Reflecting on the text production
and transmission
• Enrichment with human knowledge
• As part of scholarly discourse
The assertive edition …
… is an scholarly edition which includes a formal representation of the
assertions on the historical reality made by a document in the
interpretation of the editor.
• Assertion: a proposition / statement
• historical reality: what scholars think that people in the past did and suffered
• Made by a document: a physical object carrying text as a means of
communication (made in the past)
• Interpretation of the editor: as only the editors are part of the current
scholarly discourse
• Formal representation: RDF triples linked to a digital representation of the
document
Vogeler 2018
Humans!
Feed the machine
and you will get great insights.
?
Humans!
Integrate the machine into your discourse
and you will get great insights.
Georg Vogler
georg.vogeler@uni-graz.at
http://www.i-d-e.dehttp://informationsmodellierung.uni-graz.at
References
• Himanis: http://himanis.org/
• ChartEx: https://chartex.org/
• Vogeler, Georg (2018). “The ‘assertive edition’”. In: International
Journal of Digital Humanities 1. Forthcoming.
This work is licensed under a Creative Commons Namensnennung 4.0
International License.
All works of other author cited here are their intellectual property and
are used for academic teaching purpose only.

Working digitally with Historical Documents

  • 1.
    Working digitally with HistoricalDocuments Georg Vogler @gvogeler http://www.i-d-e.dehttp://informationsmodellierung.uni-graz.at Napoli, 25.9.2018
  • 2.
    Bridge the distancebetween modern use and historical production „Digital Scholarly Edition“ „Historical Analysis“
  • 3.
    People in thePast and their Activities Humanities Scholars, in particular Historians Archival Document lists, databases, spreadsheets, reference works , ... scholarly editionscholarly edition index, regestaindex, regesta TextsTexts Word, PDF. HTML, SVG, ... csv, xslx, SQL ... TEI Digital Images EAD/ RiC RDFs, OWL RDF CIDOC-CRM, … Interpretation Presentation Data analysis Annotation Scan / Photographs Description Transformation Conceptuali- sation Data creation OCR/HTR, Transcription
  • 4.
    Metasource (J.-Ph.Genet1994) Bridge the distancebetween modern use and historical production „Digital Scholarly Edition“ • Select object • Digitise the archival document • Create full text • Structure text • Annotate / enrich with external knowledge • Convert text into structured data „Historical Analysis“ • Modelling research question and the data needed to answer the question • Select data • Evaluating algorithms / tools to process data • Visualise / organise data in a meaningful way
  • 5.
  • 6.
    Digitisation Human • Selection ofobjects to be digitized • Decision on the appropriate method • Quality control Machine • Pixel representation • OCR/HTR • Make the documents available in the internet
  • 7.
  • 8.
    Digitisation Human • Selection ofobjects to be digitized • Decision on the appropriate method • Quality control • Integrate into scholarly discourse Machine • Pixel representation • Suggestions for layout • Suggestion for transcriptions (by training with human transcriptions) • Publish
  • 9.
    Information Extraction? Human Machine •Named Entity Recognition • „Topic Modelling“
  • 10.
    Screenshot from theChartEx annotation tool
  • 11.
  • 12.
    Information Extraction Human • „semantic“annotation • „If you have my name, you still don‘t know me.“ • Manual annotation • Identifying (Imported / exported) • Classification schemes • Integrate into scholarly discourse Machine • Named Entity Recognition • In modern texts • Linguistic method • „Topic Modelling“ • groups of words typical for a specific text chunk • Linguistic “surface”
  • 13.
    The Human inthe Loop Digitization • Sensoric representation • Algorithmic conversion • On the „linguistic surface“ Digital Edition • Reflecting on the text production and transmission • Enrichment with human knowledge • As part of scholarly discourse
  • 14.
    The assertive edition… … is an scholarly edition which includes a formal representation of the assertions on the historical reality made by a document in the interpretation of the editor. • Assertion: a proposition / statement • historical reality: what scholars think that people in the past did and suffered • Made by a document: a physical object carrying text as a means of communication (made in the past) • Interpretation of the editor: as only the editors are part of the current scholarly discourse • Formal representation: RDF triples linked to a digital representation of the document Vogeler 2018
  • 15.
    Humans! Feed the machine andyou will get great insights. ?
  • 16.
    Humans! Integrate the machineinto your discourse and you will get great insights. Georg Vogler georg.vogeler@uni-graz.at http://www.i-d-e.dehttp://informationsmodellierung.uni-graz.at
  • 17.
    References • Himanis: http://himanis.org/ •ChartEx: https://chartex.org/ • Vogeler, Georg (2018). “The ‘assertive edition’”. In: International Journal of Digital Humanities 1. Forthcoming.
  • 18.
    This work islicensed under a Creative Commons Namensnennung 4.0 International License. All works of other author cited here are their intellectual property and are used for academic teaching purpose only.