SlideShare a Scribd company logo
Generale Missieven
// clariah-wp6 use case 1
2020-11-17 DANS-
research meeting

Dirk Roorda

dirk.roorda@dans.knaw.nl
Generale Missieven
• yearly letters from governor
and board of the Dutch East
Indian Company to the Dutch
government (Heren XVII)

• 1610-1761

• 13 volumes

• 565 letters

• 10,000 pages
resources.huygens.knaw.nl/vocgeneralemissiven
Inside
page header
letter head
provenance
modern
editor
modern
editor
transcribed
original text
transcribed
original text
provenance
footnotes by
modern editor
resources.huygens.knaw.nl/retroboeken
1960 I Coolhaas
1985 VIII Coolhaas†
1988 IX Van Goor
2004 X Van Goor
1997 XI
Schooneveld-
Oosterling
2007 XII
Schooneveld-
Oosterling
2007 XIII s'Jacob
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
It is a toolkit / model / framework /
ethos to

1. get corpus data into RAM

2. compute with it efficiently

3. harvest results

4. recycle results back to the corpus
and to do this in a way that

1. is reproducible

2. reduces friction
that's a long and winding road
Source: TEI
page number
it's ok for automatic
processing,

very discouraging for
manual checking and
double checking
very long lines
inhuman file names
Laundry - trim0
• some pages are hopeless

• we re-sourced data from the OCR strings of the
Huygens website

• cases:

• letters without original content not in TEI (but
there is editorial content and metadata)

• pages with big tables (landscape) resulted in
pathological TEI
Humane data!
file names
are page
numbers metadata is flattened
much of the XML overhead is gone
line breaks are
reflected in the
layout
All the inherent
problems in this
dataset are still there.

But now we have
hope to see them,

to tackle them.
Laundry - trim1
text separation:

• mark folio references

• correct the markup of page
headers

without this step: 

• loss of original text

• contamination of original text
vol. 2 p 538
before
after
Laundry - trim2
• metadata

• re-distil from
letter headings

• check

• diagnostics
before
after
Laundry - trim3 - the mother of all laundries
• get the editorial remarks under tight control

even when they spread across pages

• detect all 12,000+ footnote bodies correctly (done)

• connect all footnote refs to their bodies (done)
None of this is feasible without successful completion of the previous steps.
745.3 92 9)
( ... retourschepen 745. 3. 929 )
or
running trim3
in progress
finally
corrections, corrections
End of laundry
github.com/Dans-labs/clariah-gm/xml/
Centrifuge
• Result:

• clean, dry stuff: Text-Fabric
github.com/Dans-labs/clariah-gm/tf/
With clean XML in hand, We centrifuge
the XML out of the clean laundry:

• we squeeze out all tag material
(moisture)

• leaving only pure content (dry clothes)

• ready to process (ready to wear)
The end of the road?
Local browse/
search interface
computing
interface
tutorialnotebooksonline
nbviewer
• start 

• move around programmatically

• search
• get in focus

• compute
• refine by computing

• exportExcel
• collect work sheets

• annotate
• insights are the new data

• share
• let others collect your data as easily as you
collected this corpus
annotation/tutorials/missieven
searchsearch
compute
compute
compute
compute
annotatebrat
annotate
shareshare
the road ends
what does this road mean?
• for researchers?

• for CLARIAH?

• for DANS / eScience Center / Humanities Cluster / HuygensING

researchers
• short road to be completely "hands
on" with their own corpora

• compute in their first programming
language: "XML"

• no technological overhead outside
their computing scope: XML, RDF, PID

• no metadata intricacy

• focus on data according to their own
mental concepts: the data features
TF corpora
CLARIAH
• a unified practice to compute with corpora:

• students of different corpora can share practices

• they can build cookbooks that transcend their
particular corpus

• remember "peculiarity of missives"?

• nearly the same recipe exists for a dozen
corpora

• where is greater gain:

• sorting out metadata?

• support the processing of metadata ?
TF corpora
DANS / eScience / HuC / archives
Text-Fabric uses GitHub as data-backend!

• GitHub is unique in supporting versioned data check-in / check-out

• GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation

YET: 

• GH is optimized for code, not (big) data

• although you can do private repos, there GH has little support for access roles

AND

• GH's diffing techniques maybe over the top for data
DANS / eScience / HuC / archives
We need another data backend:

• based on the practices of a FAIR repository

• where researchers have the same kind of control as they have in GitHub

• that supports versioning

• where you can download specific versions of specific subfolders of
specific datasets under program control: API
DANS / eScience / HuC / archives
• We need a TextHub, a Data Station for processable, annotated Text

• One corpus has many authors that deliver many parts of the data

• Authors control their own parts and share them from places they "own" on
the Hub

• Users grab those parts from the Hub under program control

• And deliver the new parts they create to the Hub
DANS / eScience / HuC / archives
DANS: provide the Hub (Data Station in Dataverse)

eScience: support best computing practices around the Hub

HuC: consider the Hub as a hop-on to larger infrastructure

Archives: invest in resources on the shelf: make them Hub ready
Computing Companion
helps a student of a corpus to approach it with her computer-aided intelligence
corpus data into memory

compute

harvest

share & recycle
be reproducible

go smoothly
dirk.roorda@dans.knaw.nl

More Related Content

What's hot

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2VIVEKVANAVAN
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB InternalsInfluxData
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsItamar Haber
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 {code}
 
Geek camp
Geek campGeek camp
Geek campjdhok
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio, Inc.
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabAlluxio, Inc.
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATAInfluxData
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on HadoopPaco Nathan
 

What's hot (15)

Basic Hadoop Architecture V1 vs V2
Basic  Hadoop Architecture  V1 vs V2Basic  Hadoop Architecture  V1 vs V2
Basic Hadoop Architecture V1 vs V2
 
Hadoop data analysis
Hadoop data analysisHadoop data analysis
Hadoop data analysis
 
InfluxDB Internals
InfluxDB InternalsInfluxDB Internals
InfluxDB Internals
 
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It StartsRedis & MongoDB: Stop Big Data Indigestion Before It Starts
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
 
Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016 Open Source is Good for Both Business and Humanity - DockerCon 2016
Open Source is Good for Both Business and Humanity - DockerCon 2016
 
RubiX
RubiXRubiX
RubiX
 
Apache cassandra
Apache cassandraApache cassandra
Apache cassandra
 
Geek camp
Geek campGeek camp
Geek camp
 
Hive training
Hive trainingHive training
Hive training
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
DOWNSAMPLING DATA
DOWNSAMPLING DATADOWNSAMPLING DATA
DOWNSAMPLING DATA
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Getting Started on Hadoop
Getting Started on HadoopGetting Started on Hadoop
Getting Started on Hadoop
 

Similar to General Missives

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesEnrico Daga
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterEnrico Daga
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationBeGooden-IT Consulting
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - HadoopTalentica Software
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanityCharlie Morris
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWSLindsay Millard
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop OverviewBrian Enochson
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightTillmann Eitelberg
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingAnne Nicolas
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache HadoopC4Media
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Globus
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsDataWorks Summit
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemMahabubur Rahaman
 

Similar to General Missives (20)

CityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tablesCityLABS Workshop: Working with large tables
CityLABS Workshop: Working with large tables
 
OU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data ClusterOU RSE Tutorial Big Data Cluster
OU RSE Tutorial Big Data Cluster
 
Kubeflow.pptx
Kubeflow.pptxKubeflow.pptx
Kubeflow.pptx
 
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl applicationQuerix 4 gl app analyzer 2016 journey to the center of your 4gl application
Querix 4 gl app analyzer 2016 journey to the center of your 4gl application
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Drupal, git and sanity
Drupal, git and sanityDrupal, git and sanity
Drupal, git and sanity
 
Python the lingua franca of FEWS
Python the lingua franca of FEWSPython the lingua franca of FEWS
Python the lingua franca of FEWS
 
R meetup 20161011v2
R meetup 20161011v2R meetup 20161011v2
R meetup 20161011v2
 
Data Science
Data ScienceData Science
Data Science
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
Asbury Hadoop Overview
Asbury Hadoop OverviewAsbury Hadoop Overview
Asbury Hadoop Overview
 
SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s goingKernel Recipes 2016 - Kernel documentation: what we have and where it’s going
Kernel Recipes 2016 - Kernel documentation: what we have and where it’s going
 
Building Applications using Apache Hadoop
Building Applications using Apache HadoopBuilding Applications using Apache Hadoop
Building Applications using Apache Hadoop
 
Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!Shaping the Future: To Globus Compute and Beyond!
Shaping the Future: To Globus Compute and Beyond!
 
Agile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics ApplicationsAgile Data: Building Hadoop Analytics Applications
Agile Data: Building Hadoop Analytics Applications
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Introduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop EcosystemIntroduction to Apache Hadoop Ecosystem
Introduction to Apache Hadoop Ecosystem
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 

More from Dirk Roorda

Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-FabricDirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysisDirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsDirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchersDirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew BibleDirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissenDirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDirk Roorda
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsDirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleDirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Text fabric
Text fabricText fabric
Text fabric
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Hebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, LessonsHebrew Bible as Data: Laboratory, Sharing, Lessons
Hebrew Bible as Data: Laboratory, Sharing, Lessons
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 

Recently uploaded

slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxCapitolTechU
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxShibin Azad
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resourcesdimpy50
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTechSoup
 
The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...sanghavirahi2
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointELaRue0
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxheathfieldcps1
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaasiemaillard
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxCeline George
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxakshayaramakrishnan21
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Celine George
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...Nguyen Thanh Tu Collection
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringDenish Jangid
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...Nguyen Thanh Tu Collection
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleCeline George
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17Celine George
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportAvinash Rai
 

Recently uploaded (20)

slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptxslides CapTechTalks Webinar May 2024 Alexander Perry.pptx
slides CapTechTalks Webinar May 2024 Alexander Perry.pptx
 
Gyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptxGyanartha SciBizTech Quiz slideshare.pptx
Gyanartha SciBizTech Quiz slideshare.pptx
 
Benefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational ResourcesBenefits and Challenges of Using Open Educational Resources
Benefits and Challenges of Using Open Educational Resources
 
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdfTelling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
Telling Your Story_ Simple Steps to Build Your Nonprofit's Brand Webinar.pdf
 
The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...The impact of social media on mental health and well-being has been a topic o...
The impact of social media on mental health and well-being has been a topic o...
 
Open Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPointOpen Educational Resources Primer PowerPoint
Open Educational Resources Primer PowerPoint
 
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
Operations Management - Book1.p  - Dr. Abdulfatah A. SalemOperations Management - Book1.p  - Dr. Abdulfatah A. Salem
Operations Management - Book1.p - Dr. Abdulfatah A. Salem
 
The basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptxThe basics of sentences session 4pptx.pptx
The basics of sentences session 4pptx.pptx
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
An Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptxAn Overview of the Odoo 17 Discuss App.pptx
An Overview of the Odoo 17 Discuss App.pptx
 
Salient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptxSalient features of Environment protection Act 1986.pptx
Salient features of Environment protection Act 1986.pptx
 
NCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdfNCERT Solutions Power Sharing Class 10 Notes pdf
NCERT Solutions Power Sharing Class 10 Notes pdf
 
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 2 STEPS Using Odoo 17
 
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
50 ĐỀ LUYỆN THI IOE LỚP 9 - NĂM HỌC 2022-2023 (CÓ LINK HÌNH, FILE AUDIO VÀ ĐÁ...
 
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & EngineeringBasic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
Basic Civil Engg Notes_Chapter-6_Environment Pollution & Engineering
 
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
GIÁO ÁN DẠY THÊM (KẾ HOẠCH BÀI BUỔI 2) - TIẾNG ANH 8 GLOBAL SUCCESS (2 CỘT) N...
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
How to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS ModuleHow to Split Bills in the Odoo 17 POS Module
How to Split Bills in the Odoo 17 POS Module
 
How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17How to Manage Notification Preferences in the Odoo 17
How to Manage Notification Preferences in the Odoo 17
 
Industrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training ReportIndustrial Training Report- AKTU Industrial Training Report
Industrial Training Report- AKTU Industrial Training Report
 

General Missives

  • 1. Generale Missieven // clariah-wp6 use case 1 2020-11-17 DANS- research meeting Dirk Roorda dirk.roorda@dans.knaw.nl
  • 2. Generale Missieven • yearly letters from governor and board of the Dutch East Indian Company to the Dutch government (Heren XVII) • 1610-1761 • 13 volumes • 565 letters • 10,000 pages resources.huygens.knaw.nl/vocgeneralemissiven
  • 3. Inside page header letter head provenance modern editor modern editor transcribed original text transcribed original text provenance footnotes by modern editor resources.huygens.knaw.nl/retroboeken 1960 I Coolhaas 1985 VIII Coolhaas† 1988 IX Van Goor 2004 X Van Goor 1997 XI Schooneveld- Oosterling 2007 XII Schooneveld- Oosterling 2007 XIII s'Jacob
  • 4. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence It is a toolkit / model / framework / ethos to 1. get corpus data into RAM 2. compute with it efficiently 3. harvest results 4. recycle results back to the corpus and to do this in a way that 1. is reproducible 2. reduces friction
  • 5. that's a long and winding road
  • 6.
  • 7. Source: TEI page number it's ok for automatic processing, very discouraging for manual checking and double checking very long lines inhuman file names
  • 8. Laundry - trim0 • some pages are hopeless • we re-sourced data from the OCR strings of the Huygens website • cases: • letters without original content not in TEI (but there is editorial content and metadata) • pages with big tables (landscape) resulted in pathological TEI
  • 9. Humane data! file names are page numbers metadata is flattened much of the XML overhead is gone line breaks are reflected in the layout All the inherent problems in this dataset are still there. But now we have hope to see them, to tackle them.
  • 10. Laundry - trim1 text separation: • mark folio references • correct the markup of page headers without this step: • loss of original text • contamination of original text vol. 2 p 538 before after
  • 11. Laundry - trim2 • metadata • re-distil from letter headings • check • diagnostics before after
  • 12. Laundry - trim3 - the mother of all laundries • get the editorial remarks under tight control even when they spread across pages • detect all 12,000+ footnote bodies correctly (done) • connect all footnote refs to their bodies (done) None of this is feasible without successful completion of the previous steps.
  • 13. 745.3 92 9) ( ... retourschepen 745. 3. 929 ) or
  • 14.
  • 18. Centrifuge • Result: • clean, dry stuff: Text-Fabric github.com/Dans-labs/clariah-gm/tf/ With clean XML in hand, We centrifuge the XML out of the clean laundry: • we squeeze out all tag material (moisture) • leaving only pure content (dry clothes) • ready to process (ready to wear)
  • 19. The end of the road?
  • 23. • start • move around programmatically • search • get in focus • compute • refine by computing • exportExcel • collect work sheets • annotate • insights are the new data • share • let others collect your data as easily as you collected this corpus annotation/tutorials/missieven
  • 30. what does this road mean? • for researchers? • for CLARIAH? • for DANS / eScience Center / Humanities Cluster / HuygensING

  • 31. researchers • short road to be completely "hands on" with their own corpora • compute in their first programming language: "XML" • no technological overhead outside their computing scope: XML, RDF, PID • no metadata intricacy • focus on data according to their own mental concepts: the data features TF corpora
  • 32. CLARIAH • a unified practice to compute with corpora: • students of different corpora can share practices • they can build cookbooks that transcend their particular corpus • remember "peculiarity of missives"? • nearly the same recipe exists for a dozen corpora • where is greater gain: • sorting out metadata? • support the processing of metadata ? TF corpora
  • 33. DANS / eScience / HuC / archives Text-Fabric uses GitHub as data-backend! • GitHub is unique in supporting versioned data check-in / check-out • GitHub is a hub toward top-notch preservation services: Zenodo, Software Heritage Foundation YET: • GH is optimized for code, not (big) data • although you can do private repos, there GH has little support for access roles AND • GH's diffing techniques maybe over the top for data
  • 34. DANS / eScience / HuC / archives We need another data backend: • based on the practices of a FAIR repository • where researchers have the same kind of control as they have in GitHub • that supports versioning • where you can download specific versions of specific subfolders of specific datasets under program control: API
  • 35. DANS / eScience / HuC / archives • We need a TextHub, a Data Station for processable, annotated Text • One corpus has many authors that deliver many parts of the data • Authors control their own parts and share them from places they "own" on the Hub • Users grab those parts from the Hub under program control • And deliver the new parts they create to the Hub
  • 36. DANS / eScience / HuC / archives DANS: provide the Hub (Data Station in Dataverse) eScience: support best computing practices around the Hub HuC: consider the Hub as a hop-on to larger infrastructure Archives: invest in resources on the shelf: make them Hub ready
  • 37. Computing Companion helps a student of a corpus to approach it with her computer-aided intelligence corpus data into memory compute harvest share & recycle be reproducible go smoothly dirk.roorda@dans.knaw.nl