SlideShare a Scribd company logo
Generale Missieven
// clariah-wp6 use case 1
2020-11-17 DANS-
research meeting

Dirk Roorda

dirk.roorda@dans.knaw.nl
Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)

• 1610-1761

• 13 volumes

• 565 letters

• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven
Inside
page header
letter head
provenance
modern
editor
modern
editor
transcribed
original text
transcribed
original text
provenance
footnotes by
modern editor
resources.huygens.knaw.nl/retroboeken
1960 I Coolhaas
1985 VIII Coolhaas†
1988 IX Van Goor
2004 X Van Goor
1997 XI
Schooneveld-
Oosterling
2007 XII
Schooneveld-
Oosterling
2007 XIII s'Jacob
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to

1. get corpus data into RAM

2. compute with it efficiently

3. harvest results

4. recycle results back to the corpus
and to do this in a way that

1. is reproducible

2. reduces friction
that's a long and winding road
Source: TEI
page number
it's ok for automatic
processing,

very discouraging for
manual checking and
double checking
very long lines
inhuman file names
Laundry - trim0
• some pages are hopeless

• we re-sourced data from the OCR strings of the
Huygens website

• cases:

• letters without original content not in TEI (but
there is editorial content and metadata)

• pages with big tables (landscape) resulted in
pathological TEI
Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.

But now we have
hope to see them,

to tackle them.
Laundry - trim1
text separation:

• mark folio references

• correct the markup of page
headers

without this step: 

• loss of original text

• contamination of original text
vol. 2 p 538
before
after
Laundry - trim2
• metadata

• re-distil from
letter headings

• check

• diagnostics
before
after
Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control

even when they spread across pages

• detect all 12,000+ footnote bodies correctly (done)

• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.
745.3 92 9)
( ... retourschepen 745. 3. 929 )
or
running trim3
in progress
finally
corrections, corrections
End of laundry
github.com/Dans-labs/clariah-gm/xml/
Centrifuge
• Result:

• clean, dry stuff: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:

• we squeeze out all tag material
(moisture)

• leaving only pure content (dry clothes)

• ready to process (ready to wear)
The end of the road?
Local browse/
search interface
computing
interface
tutorialnotebooksonline
nbviewer
• start 

• move around programmatically

• search
• get in focus

• compute
• refine by computing

• exportExcel
• collect work sheets

• annotate
• insights are the new data

• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven
searchsearch
compute
compute
compute
compute
annotatebrat
annotate
shareshare
the road ends
what does this road mean?
• for researchers?

• for CLARIAH?

• for DANS / eScience Center / Humanities Cluster / HuygensING

researchers
• short road to be completely "hands
on" with their own corpora

• compute in their first programming
language: "XML"

• no technological overhead outside
their computing scope: XML, RDF, PID

• no metadata intricacy

• focus on data according to their own
mental concepts: the data features
TF corpora
CLARIAH
• a unified practice to compute with corpora:

• students of different corpora can share practices

• they can build cookbooks that transcend their
particular corpus

• remember "peculiarity of missives"?

• nearly the same recipe exists for a dozen
corpora

• where is greater gain:

• sorting out metadata?

• support the processing of metadata ?
TF corpora
DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!

• GitHub is unique in supporting versioned data check-in / check-out

• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation

YET: 

• GH is optimized for code, not (big) data

• although you can do private repos, there GH has little support for access roles

AND

• GH's diffing techniques maybe over the top for data
DANS / eScience / HuC / archives
We need another data backend:

• based on the practices of a FAIR repository

• where researchers have the same kind of control as they have in GitHub

• that supports versioning

• where you can download specific versions of specific subfolders of
specific datasets under program control: API
DANS / eScience / HuC / archives
• We need a TextHub, a Data Station for processable, annotated Text

• One corpus has many authors that deliver many parts of the data

• Authors control their own parts and share them from places they "own" on
the Hub

• Users grab those parts from the Hub under program control

• And deliver the new parts they create to the Hub
DANS / eScience / HuC / archives
DANS: provide the Hub (Data Station in Dataverse)

eScience: support best computing practices around the Hub

HuC: consider the Hub as a hop-on to larger infrastructure

Archives: invest in resources on the shelf: make them Hub ready
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory

compute

harvest

share & recycle
be reproducible

go smoothly
dirk.roorda@dans.knaw.nl

More Related Content

What's hot

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
VIVEKVANAVAN
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
Vakul Vankadaru
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
InfluxData
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Itamar Haber
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016
{code}
 
RubiX
RubiXRubiX
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
Adnan Siddiqi
 
Geek camp
Geek campGeek camp
Geek camp
jdhok
 
Hive training
Hive trainingHive training
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
InfluxData
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
Paco Nathan
 

What's hot (15)

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016
 
RubiX
RubiXRubiX
RubiX
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Geek camp
Geek campGeek camp
Geek camp
 
Hive training
Hive trainingHive training
Hive training
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 

Similar to General Missives

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
Enrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
Enrico Daga
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
dhaferbenali1
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
BeGooden-IT Consulting
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
Talentica Software
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
Charlie Morris
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
Lindsay Millard
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
Niels Ole Dam
 
Data Science
Data ScienceData Science
Data Science
Ahmet Bulut
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
Stephen Turner
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
Brian Enochson
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Anne Nicolas
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
C4Media
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!
Globus
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
DataWorks Summit
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
AmirReza Mohammadi
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
Mahabubur Rahaman
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 

Similar to General Missives (20)

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Data Science
Data ScienceData Science
Data Science
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

More from Dirk Roorda

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
Dirk Roorda
 
Textpy
TextpyTextpy
Textpy
Dirk Roorda
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
Dirk Roorda
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
Dirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
Dirk Roorda
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
Dirk Roorda
 
Text fabric
Text fabricText fabric
Text fabric
Dirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
Dirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
Dirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
Dirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
Dirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
Dirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Award
AwardAward
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Dirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
Dirk Roorda
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
Dirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 

Recently uploaded

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
TechSoup
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
OH TEIK BIN
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
Celine George
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
blueshagoo1
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
Mohammad Al-Dhahabi
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
Prof. Dr. K. Adisesha
 
How to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in useHow to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in use
Celine George
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
Celine George
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
Iris Thiele Isip-Tan
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapitolTechU
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
giancarloi8888
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Henry Hollis
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
indexPub
 
Contiguity Of Various Message Forms - Rupam Chandra.pptx
Contiguity Of Various Message Forms - Rupam Chandra.pptxContiguity Of Various Message Forms - Rupam Chandra.pptx
Contiguity Of Various Message Forms - Rupam Chandra.pptx
Kalna College
 

Recently uploaded (20)

Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
 
Accounting for Restricted Grants When and How To Record Properly
Accounting for Restricted Grants  When and How To Record ProperlyAccounting for Restricted Grants  When and How To Record Properly
Accounting for Restricted Grants When and How To Record Properly
 
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptxA Free 200-Page eBook ~ Brain and Mind Exercise.pptx
A Free 200-Page eBook ~ Brain and Mind Exercise.pptx
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
 
CIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdfCIS 4200-02 Group 1 Final Project Report (1).pdf
CIS 4200-02 Group 1 Final Project Report (1).pdf
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 
skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)skeleton System.pdf (skeleton system wow)
skeleton System.pdf (skeleton system wow)
 
Data Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsxData Structure using C by Dr. K Adisesha .ppsx
Data Structure using C by Dr. K Adisesha .ppsx
 
How to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in useHow to Fix [Errno 98] address already in use
How to Fix [Errno 98] address already in use
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
 
Educational Technology in the Health Sciences
Educational Technology in the Health SciencesEducational Technology in the Health Sciences
Educational Technology in the Health Sciences
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptxCapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
CapTechTalks Webinar Slides June 2024 Donovan Wright.pptx
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdfREASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
REASIGNACION 2024 UGEL CHUPACA 2024 UGEL CHUPACA.pdf
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.pptLevel 3 NCEA - NZ: A  Nation In the Making 1872 - 1900 SML.ppt
Level 3 NCEA - NZ: A Nation In the Making 1872 - 1900 SML.ppt
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
 
Contiguity Of Various Message Forms - Rupam Chandra.pptx
Contiguity Of Various Message Forms - Rupam Chandra.pptxContiguity Of Various Message Forms - Rupam Chandra.pptx
Contiguity Of Various Message Forms - Rupam Chandra.pptx
 

General Missives

  • 1. Generale Missieven // clariah-wp6 use case 1 2020-11-17 DANS- research meeting Dirk Roorda dirk.roorda@dans.knaw.nl
  • 2. Generale Missieven • yearly letters from governor and board of the Dutch East Indian Company to the Dutch government (Heren XVII) • 1610-1761 • 13 volumes • 565 letters • 10,000 pages resources.huygens.knaw.nl/vocgeneralemissiven
  • 3. Inside page header letter head provenance modern editor modern editor transcribed original text transcribed original text provenance footnotes by modern editor resources.huygens.knaw.nl/retroboeken 1960 I Coolhaas 1985 VIII Coolhaas† 1988 IX Van Goor 2004 X Van Goor 1997 XI Schooneveld- Oosterling 2007 XII Schooneveld- Oosterling 2007 XIII s'Jacob
  • 4. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence It is a toolkit / model / framework / ethos to 1. get corpus data into RAM 2. compute with it efficiently 3. harvest results 4. recycle results back to the corpus and to do this in a way that 1. is reproducible 2. reduces friction
  • 5. that's a long and winding road
  • 6.
  • 7. Source: TEI page number it's ok for automatic processing, very discouraging for manual checking and double checking very long lines inhuman file names
  • 8. Laundry - trim0 • some pages are hopeless • we re-sourced data from the OCR strings of the Huygens website • cases: • letters without original content not in TEI (but there is editorial content and metadata) • pages with big tables (landscape) resulted in pathological TEI
  • 9. Humane data! file names are page numbers metadata is flattened much of the XML overhead is gone line breaks are reflected in the layout All the inherent problems in this dataset are still there. But now we have hope to see them, to tackle them.
  • 10. Laundry - trim1 text separation: • mark folio references • correct the markup of page headers without this step: • loss of original text • contamination of original text vol. 2 p 538 before after
  • 11. Laundry - trim2 • metadata • re-distil from letter headings • check • diagnostics before after
  • 12. Laundry - trim3 - the mother of all laundries • get the editorial remarks under tight control even when they spread across pages • detect all 12,000+ footnote bodies correctly (done) • connect all footnote refs to their bodies (done) None of this is feasible without successful completion of the previous steps.
  • 13. 745.3 92 9) ( ... retourschepen 745. 3. 929 ) or
  • 14.
  • 18. Centrifuge • Result: • clean, dry stuff: Text-Fabric github.com/Dans-labs/clariah-gm/tf/ With clean XML in hand, We centrifuge the XML out of the clean laundry: • we squeeze out all tag material (moisture) • leaving only pure content (dry clothes) • ready to process (ready to wear)
  • 19. The end of the road?
  • 23. • start • move around programmatically • search • get in focus • compute • refine by computing • exportExcel • collect work sheets • annotate • insights are the new data • share • let others collect your data as easily as you collected this corpus annotation/tutorials/missieven
  • 30. what does this road mean? • for researchers? • for CLARIAH? • for DANS / eScience Center / Humanities Cluster / HuygensING

  • 31. researchers • short road to be completely "hands on" with their own corpora • compute in their first programming language: "XML" • no technological overhead outside their computing scope: XML, RDF, PID • no metadata intricacy • focus on data according to their own mental concepts: the data features TF corpora
  • 32. CLARIAH • a unified practice to compute with corpora: • students of different corpora can share practices • they can build cookbooks that transcend their particular corpus • remember "peculiarity of missives"? • nearly the same recipe exists for a dozen corpora • where is greater gain: • sorting out metadata? • support the processing of metadata ? TF corpora
  • 33. DANS / eScience / HuC / archives Text-Fabric uses GitHub as data-backend! • GitHub is unique in supporting versioned data check-in / check-out • GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation YET: • GH is optimized for code, not (big) data • although you can do private repos, there GH has little support for access roles AND • GH's diffing techniques maybe over the top for data
  • 34. DANS / eScience / HuC / archives We need another data backend: • based on the practices of a FAIR repository • where researchers have the same kind of control as they have in GitHub • that supports versioning • where you can download specific versions of specific subfolders of specific datasets under program control: API
  • 35. DANS / eScience / HuC / archives • We need a TextHub, a Data Station for processable, annotated Text • One corpus has many authors that deliver many parts of the data • Authors control their own parts and share them from places they "own" on the Hub • Users grab those parts from the Hub under program control • And deliver the new parts they create to the Hub
  • 36. DANS / eScience / HuC / archives DANS: provide the Hub (Data Station in Dataverse) eScience: support best computing practices around the Hub HuC: consider the Hub as a hop-on to larger infrastructure Archives: invest in resources on the shelf: make them Hub ready
  • 37. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence corpus data into memory compute harvest share & recycle be reproducible go smoothly dirk.roorda@dans.knaw.nl