TF-FAIR.pdf

Text-Fabric
how to do text research in a FAIR way
dirk.roorda@di.knaw.huc.nl
CLARIAH annual conf 2023-11-30

1

What is text? (1)Generale Missieven VOC (1684)

2

What is text? (2) Hebrew Bible (-1500 - + 900)

3

What is text? (3a)
Uruk (-4000 - -3100)

4

What is text? (3b)
1. 3(N04) , |GISZ.TE| GAR |SZU2.((HI+1(N57))+(HI+1(N57)))| GI4~a

5

What is (computational) research?

6

• Gather your data from a repository

6

• Model it in a logical, abstract, tractable way

6

• Analyse it by means of a suite of well-chosen tools

6

• Produce results and deliver them again in a repository

6

• Discuss conclusions in a Jupyter notebook

6

• Discuss conclusions in a Jupyter notebook
• Publish and preserve everything in Zenodo/SHA, and/or on a website

6

Text-Fabric is ...
• a data model for text corpora with annotations

7

Text-Fabric is ...
• a query engine

7

Text-Fabric is ...
• a query engine
• a text weaver

7

Text-Fabric is ...
• a query engine
• a text weaver
• an API

7

Text-Fabric is ...
• a query engine
• a text weaver
• an API
• a python package pip install text-fabric

7

Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science

10

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:

10

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:
• from inline to stand-o
f

10

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:
f

10
untangle

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:
f
• from tags to features

10
untangle

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:
f
• from hierarchy to spatial relationships

10
untangle

Data model
• TEI-XML:
fi
ffi
• Text-Fabric:
f
• from hierarchy to spatial relationships
• from nested elements to tables of numbers

10
untangle

This model solves problems

13

he asked Eth<damaged>iopia for</damaged> support

13

now try to mark Ethiopia as a name of a country

13

* he asked <country>Eth<damaged>iopia</country> for</damaged> support

13

this is invalid XML!

13

• TEI is good to formulate encoding practices

13

• XML is bad for modelling the richness of text and annotations

13

• XML is bad for modelling the richness of text and annotations
• ... I long for a TEI without XML ...

13

... but for now we just take a more abstract model ...
the Text-Fabric solution is:

14

| he asked Ethiopia for support

14

---------------------------------------

14

---------------------------------------
damaged | 111111111

14

---------------------------------------
damaged | 111111111
country | 111111111

14

---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>

14

---------------------------------------
damaged | 111111111
country | 111111111
fi
les =>
• separation of concerns =>

14

---------------------------------------
damaged | 111111111
country | 111111111
fi
les =>
• separation of concerns =>
• better data logistics

14

Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15

Data logistics
140,000 lines
the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15

Repository - a social medium for software and
computational research

20

gitlab.huc.knaw.nl

22
github.com/ETCBC/bhsa

Analysis
https://nbviewer.org/github/Nino-cunei/uruk/blob/master/tutorial/calc.ipynb

23

https://nbviewer.org/github/Nino-cunei/uruk/blob/master/tutorial/calc.ipynb

24

"record" a plain text from the TF dataset

27
Named Entity Recognition

"record" a plain text from the TF dataset
and remember the original "coordinates i.e. the nodes

27
Named Entity Recognition

CLARIAH/wp6-missieven
• Generale Missieven in TF => plain text with recorded positions
cltl/voc-missives
• Sophie Arnoult (CLTL): plain text => named entities
entities notebook
• named entities + recorded positions => back to TF features

28

Reporting
• Jupyter notebooks are excellent to tell a computational story
• or to reason things out on the basis of data
• and to highlight the argument with visualisations
• and they
fi
t nicely in a repo
• and repo releases can be archived and "DOI-ed"

29

Publishing work
fl
ow
• When delivering data in repos and writing articles
in journals is not enough ...

30

Publishing work
fl
ow
• Build a website

30

Publishing work
fl
ow
• Build a website
• Infrastructure needed to do this e
ffi
ciently for
many corpora

30

31
GitRepo
researcher
Jupyter
notebook

source
tei-pagexml
ascii-database

31
GitRepo
researcher
Jupyter
notebook

data science
interface
source
tei-pagexml
ascii-database

31
GitRepo
researcher
Jupyter
notebook

data science
interface
source
tei-pagexml
ascii-database
pre-process

31
GitRepo
researcher
Jupyter
notebook

data science
interface
source
tei-pagexml
ascii-database
pre-process

31
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process

31
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process

31
team-text - production street to online
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process

31
TextRepo
AnnoRepo
Broccoli
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process

31
TextRepo
AnnoRepo
Broccoli TextAnnoViz
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process back-end

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process back-end broker

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process back-end broker front-end

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook

data science
interface
general public
interface
source
tei-pagexml
ascii-database

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle

data science
interface
general public
interface
source
tei-pagexml
ascii-database

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle
corpus must
fi
t in RAM,
large corpora: by volume

data science
interface
general public
interface
source
tei-pagexml
ascii-database

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle
can do large corpora,
but also small ones
corpus must
fi
t in RAM,

data science
interface
general public
interface
source
tei-pagexml
ascii-database

31
TextRepo
AnnoRepo
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle
can do large corpora,
but also small ones
corpus must
fi
t in RAM,
Globalise
Republic
Mondriaan
Suriano
Translatin
Hermans

Final remarks

32
work in a repo: start-to-
fi
nish

Final remarks

32
fi
nish
logic enables logistics

Final remarks

32
fi
nish
stand-o
ff
annotations keep it clean

Final remarks

32
fi
nish
stand-o
ff
let tools support repo operations

Final remarks

32
fi
nish
stand-o
ff
just one example of how this can be done
pip install text-fabric
github.com/annotation/text-fabric

Final remarks

32
fi
nish
stand-o
ff
just one example of how this can be done
dirk.roorda@di.knaw.huc.nl
thank you
pip install text-fabric
github.com/annotation/text-fabric

TF-FAIR.pdf

Recommended

Recommended

More Related Content

Similar to TF-FAIR.pdf

Similar to TF-FAIR.pdf (20)

More from Dirk Roorda

More from Dirk Roorda (20)

Recently uploaded

Recently uploaded (20)

TF-FAIR.pdf