This document discusses text research and the Text-Fabric data model. It describes Text-Fabric as a data model for annotated text corpora, a query engine, a text weaver, and an API. The data model transforms TEI-XML into separate feature files to untangle annotations and enable better data logistics. Computational research involves gathering data from repositories, modeling and analyzing it, publishing results back to repositories, and discussing conclusions in notebooks. Publishing work flows include building websites to deliver research outputs to the general public more accessibly.
14. • Gather your data from a repository
What is (computational) research?

6
15. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
What is (computational) research?

6
16. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
What is (computational) research?

6
17. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
What is (computational) research?

6
18. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
• Discuss conclusions in a Jupyter notebook
What is (computational) research?

6
19. • Gather your data from a repository
• Model it in a logical, abstract, tractable way
• Analyse it by means of a suite of well-chosen tools
• Produce results and deliver them again in a repository
• Discuss conclusions in a Jupyter notebook
• Publish and preserve everything in Zenodo/SHA, and/or on a website
What is (computational) research?

6
22. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine

7
23. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver

7
24. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver
• an API

7
25. Text-Fabric is ...
• a data model for text corpora with annotations
• a query engine
• a text weaver
• an API
• a python package pip install text-fabric

7
35. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f

10
36. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f

10
untangle
37. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features

10
untangle
38. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features
• from hierarchy to spatial relationships

10
untangle
39. Data model
• TEI-XML:
fi
ne for archiving, di
ffi
cult for data science
• Text-Fabric:
• from inline to stand-o
f
• from tags to features
• from hierarchy to spatial relationships
• from nested elements to tables of numbers

10
untangle
43. This model solves problems
he asked Eth<damaged>iopia for</damaged> support

13
44. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country

13
45. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support

13
46. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!

13
47. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices

13
48. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices
• XML is bad for modelling the richness of text and annotations

13
49. This model solves problems
he asked Eth<damaged>iopia for</damaged> support
now try to mark Ethiopia as a name of a country
* he asked <country>Eth<damaged>iopia</country> for</damaged> support
this is invalid XML!
• TEI is good to formulate encoding practices
• XML is bad for modelling the richness of text and annotations
• ... I long for a TEI without XML ...

13
51. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:

14
52. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support

14
53. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------

14
54. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111

14
55. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111

14
56. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>

14
57. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>
• separation of concerns =>

14
58. ... but for now we just take a more abstract model ...
the Text-Fabric solution is:
| he asked Ethiopia for support
---------------------------------------
damaged | 111111111
country | 111111111
• the data for damaged and country end up in separate
fi
les =>
• separation of concerns =>
• better data logistics

14
59. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
60. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
61. Data logistics the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
62. Data logistics
140,000 lines
the whole corpus is just a bunch of separate
fi
les, each
dealing with a well de
fi
ned aspect of the data

15
84. "record" a plain text from the TF dataset

27
Named Entity Recognition
85. "record" a plain text from the TF dataset
and remember the original "coordinates i.e. the nodes

27
Named Entity Recognition
86. "record" a plain text from the TF dataset
and remember the original "coordinates i.e. the nodes

27
Named Entity Recognition
87. CLARIAH/wp6-missieven
• Generale Missieven in TF => plain text with recorded positions
cltl/voc-missives
• Sophie Arnoult (CLTL): plain text => named entities
entities notebook
• named entities + recorded positions => back to TF features

28
88. Reporting
• Jupyter notebooks are excellent to tell a computational story
• or to reason things out on the basis of data
• and to highlight the argument with visualisations
• and they
fi
t nicely in a repo
• and repo releases can be archived and "DOI-ed"

29
91. Publishing work
fl
ow
• When delivering data in repos and writing articles
in journals is not enough ...
• Build a website

30
92. Publishing work
fl
ow
• When delivering data in repos and writing articles
in journals is not enough ...
• Build a website
• Infrastructure needed to do this e
ffi
ciently for
many corpora

30
113. data science
interface
general public
interface
source
tei-pagexml
ascii-database
pre-process back-end broker front-end

31
TextRepo
AnnoRepo
Broccoli TextAnnoViz
team-text - production street to online
your web
browser
user
GitRepo
researcher
Jupyter
notebook
untangle
can do large corpora,
but also small ones
corpus must
fi
t in RAM,
large corpora: by volume
Globalise
Republic
Mondriaan
Suriano
Translatin
Hermans
117. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
118. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
119. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
just one example of how this can be done
pip install text-fabric
github.com/annotation/text-fabric
120. Final remarks

32
work in a repo: start-to-
fi
nish
logic enables logistics
stand-o
ff
annotations keep it clean
let tools support repo operations
just one example of how this can be done
dirk.roorda@di.knaw.huc.nl
thank you
pip install text-fabric
github.com/annotation/text-fabric