SlideShare a Scribd company logo
text + annotations
represented for
slicing and dicing - collaboration - versioning
Dirk Roorda
SBL San Antonio
2016-11-18
...
<clause id="3" rela="Objc">
<phrase id="27" type="VP">
<word id="153" sp="verb" lex="BR&gt;[" phono="bārˈā"
lex_heb="‫א‬ ָּ‫ר‬ ָּ‫"ב‬ trans="B.@R@74&gt;"
gloss="create" partofspeech="verb"
trailer=" ">‫א‬ ָָּ֣‫ר‬ ָּ‫/<ב‬word>
<word ...>...</word>
...
</phrase>
<phrase ...>...</phrase>
...
</clause>
...
straightforward XML
- bloated
- hierarchy problems
- inferior tooling (xslt)
- no separation of concerns
- a horrible mess
- even more bloated
+ graph (nodes/edges)
+ LAF-Fabric (python)
+ separation of concerns
~ an awful load of megabytes
LAF-Fabric binary
~ reasonably compact
- a bit opaque
+ column storage
+ python (pickle + gzip)
+ separation of concerns
~ starting to look decent
Text Fabric
+ compact
- transparent
+ column storage
+ pure python
+ separation of concerns
~ with the syntax
bureaucracy out of the way,
we can start to look into the
real problems:
* slicing and dicing
* collaboration
* versioning
B
R>CJT/
BR>[
>LHJM/
>T
H
CMJM/
W
>T
H
>RY/
W
H
>RY/
HJH[
THW/
W
BHW/
W
XCK/
first 20 words from 200000
200000 BN/
KL/
H
JWM/
B
JWM/
PC<[
>DWM/
MN
TXT/
JD/
JHWDH/
W
MLK[
<L
MLK/
W
<BR[
JWRM/
Y<JR=/
Text Fabric (inside)
LAF-Fabric, with the LAF replaced by Text
Inside TF, the LAF-Fabric-API just works:
for n in NN():
if F.otype.v(n) == 'verse':
label = T.passage(n)
text = T.words(
L.d('word', n),
fmt='pf',
)
print('{}t{}n'.format(
label, text,
))
Text Fabric (skeleton)
A TF dataset is a bunch of text files
with at least otype and monads
otype
0-426580 word
426581-514580 clause
514581-605142 clause_atom
605143-858316 phrase
858317-1125831 phrase_atom
1125832-1189401 sentence
1189402-1253740 sentence_atom
1253741-1367532 subphrase
1367533-1367571 book
1367572-1368500 chapter
1368501-1413680 half_verse
1413681-1436893 verse
monads
1367533 1-28762
1367534 28763-52510
1367572 1-673
1367573 674-1167
1413681 1-11
1413682 12-31
1368501 1-4
1368502 5-11
1125832 1-11
1125833 12-18
1189402 1-11
1189403 12-18
426581 1-11
426582 12-18
514581 1-11
514582 12-18
605143 1-2
605144 3
858317 1-2
858318 3
1253741 5-7
1253742 9-11
This is the
skeleton:
• the positions
• the containment
of all text objects
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
parent
0-1,858317605143
2,858318 605144
3,858319 605145
4-10,858320 605146
426581,1189402 1125832
514581,605143-605146426581
information content
trailer
_
_
_
_
_
_
‫_׃‬
_
_
_
_
_
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
Text Fabric (flesh)
parent
0-1,858317605143
2,858318 605144
3,858319 605145
4-10,858320 605146
514581,605143-605146426581
426581,1189402 1125832
Edges (c't'd)
0 1 2 3 4 5 6 7 8 9 10 word
858317 858318 858319 858320 phrase_atom
bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ
605143 605144 605145 605146 phrase
426581 clause
1125832 sentence
flesh
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
trailer
_
_
_
_
_
_
‫_׃‬
_
_
_
_
_
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
typ
PP
NP
VP
NP
-
Objc
Resu
skeleton
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
monads
600000 0-1
600001 2
600002 3
600003 4-10
400001 0-10
400002 11-21
400003 22-34
vertical:
feature
selection
horizontal:
object
selection
both:
modules
slicing 'n dicing
collaborationflesh
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
p-s-p
prep
subs
verb
subs
prep
art
subs
conj
prep
art
subs
conj
art
subs
verb
subs
conj
subs
conj
subs
skeleton
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
monads
600000 0-1
600001 2
600002 3
600003 4-10
400001 0-10
400002 11-21
400003 22-34
module:
feature strong
for words
module
strong
8675
7225
1254 a
430
853
8676
8064
8678
853
8676
776
8678
8676
776
1961
8414
8678
922
8678
2822
dependency:
on the implicit
monad order!
@20170101 @20161118
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
10 9
11 10
12 11
13 12
14 13
15 14
16 15
17 16
18 17
19 18
20 19
21 20
600001 600000
600002 600001
600003 600002
600004 600003
400002 400001
400003 400002
400004 400003
⇐
flesh
@20170101
text_full
ְּ‫ב‬
‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬
‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬
‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬
‫ת‬ ֵֵ֥‫א‬
ְּ‫ה‬
‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬
ְּ‫ו‬
‫ת‬ ֵֵ֥‫א‬
‫ש‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬
ְּ‫ו‬
ְָּּ‫ה‬
‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬
‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬
ְּ‫הו‬ ֹ֨‫ת‬
ְָּּ‫ו‬
‫הו‬ ֹ֔‫ב‬
ְּ‫ו‬
‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬
otype
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
word
phrase
phrase
phrase
phrase
clause
clause
clause
skeleton
@20170101
monads
9
600001 0-1
600002 2
600003 3
600004 4-11
400002 0-11
400003 12-22
400004 23-35
strong
8675
7225
1254 a
430
853
8676
8064
8678
853
8676
776
8678
8676
776
1961
8414
8678
922
8678
2822
module
@20161118versioning
old module
works as is on
new data
the end of the beginning
etcbc Stephen Ku David van
Acker
James
Cuénod
skeleton
@20160101 <=
@20170101
module strong
@20161118
flesh
@20160101
flesh
@20170101
module accent
@20170202
skeleton
@20170101 <=
@20180101
module strong
@20171118
flesh
@20180101
module assoc
@20180303
skeleton
@20180101 <=
@20190101
flesh
@20190101
skeleton
@20190101 <=
@20200101
dirk.roorda@dans.knaw.nl shebanq@ancient-data.org

More Related Content

What's hot

The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our Ways
Kevlin Henney
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
Bhavsingh Maloth
 
Artsy ♥ ASCII ART
Artsy ♥ ASCII ARTArtsy ♥ ASCII ART
Artsy ♥ ASCII ART
Daniel Doubrovkine
 
Auxiliary
AuxiliaryAuxiliary
Auxiliary
ececourse
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
Simon Hoyle
 
Sass
SassSass
Introduction to tibbles
Introduction to tibblesIntroduction to tibbles
Introduction to tibbles
Rsquared Academy
 
Web Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 EnumerationWeb Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 Enumeration
Websecurify
 
Representing Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in OmekaRepresenting Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in Omeka
Arden Kirkland
 
php string-part 2
php string-part 2php string-part 2
php string-part 2
monikadeshmane
 
Xcode Survival Guide
Xcode Survival GuideXcode Survival Guide
Xcode Survival Guide
Kristina Fox
 

What's hot (11)

The Error of Our Ways
The Error of Our WaysThe Error of Our Ways
The Error of Our Ways
 
Unit vii wp ppt
Unit vii wp pptUnit vii wp ppt
Unit vii wp ppt
 
Artsy ♥ ASCII ART
Artsy ♥ ASCII ARTArtsy ♥ ASCII ART
Artsy ♥ ASCII ART
 
Auxiliary
AuxiliaryAuxiliary
Auxiliary
 
Seistech SQL code
Seistech SQL codeSeistech SQL code
Seistech SQL code
 
Sass
SassSass
Sass
 
Introduction to tibbles
Introduction to tibblesIntroduction to tibbles
Introduction to tibbles
 
Web Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 EnumerationWeb Application Security 101 - 05 Enumeration
Web Application Security 101 - 05 Enumeration
 
Representing Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in OmekaRepresenting Material Culture Online: Historic Clothing in Omeka
Representing Material Culture Online: Historic Clothing in Omeka
 
php string-part 2
php string-part 2php string-part 2
php string-part 2
 
Xcode Survival Guide
Xcode Survival GuideXcode Survival Guide
Xcode Survival Guide
 

Similar to Text fabric

Open course(programming languages) 20150225
Open course(programming languages) 20150225Open course(programming languages) 20150225
Open course(programming languages) 20150225
JangChulho
 
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe
 
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
Mathias Herberts
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
Rodrigo Senra
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
brettflorio
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
nazzf
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
Shani729
 
PEP 498: The Monologue
PEP 498: The MonologuePEP 498: The Monologue
PEP 498: The Monologue
Mariatta Wijaya
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
Dirk Roorda
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Anne Nicolas
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
Matt Harrison
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
Wei-Yuan Chang
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
Wojciech Gawroński
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
Takeshi Arabiki
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the code
Peter Solymos
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
chenge2k
 
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDB
David Golden
 
Lettering js
Lettering jsLettering js
Lettering js
davatron5000
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbien
Romain Francois
 

Similar to Text fabric (20)

Open course(programming languages) 20150225
Open course(programming languages) 20150225Open course(programming languages) 20150225
Open course(programming languages) 20150225
 
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
codin9cafe[2015.02.25]Open course(programming languages) - 장철호(Ch Jang)
 
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop20170516 hug france-warp10-time-seriesanalysisontopofhadoop
20170516 hug france-warp10-time-seriesanalysisontopofhadoop
 
pa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processingpa-pe-pi-po-pure Python Text Processing
pa-pe-pi-po-pure Python Text Processing
 
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
/Regex makes me want to (weep|give up|(╯°□°)╯︵ ┻━┻)\.?/i
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
Python tutorial
Python tutorialPython tutorial
Python tutorial
 
PEP 498: The Monologue
PEP 498: The MonologuePEP 498: The Monologue
PEP 498: The Monologue
 
Text Display (when it gets tricky)
Text Display (when it gets tricky)Text Display (when it gets tricky)
Text Display (when it gets tricky)
 
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataKernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary data
 
Learn 90% of Python in 90 Minutes
Learn 90% of Python in 90 MinutesLearn 90% of Python in 90 Minutes
Learn 90% of Python in 90 Minutes
 
Python fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuanPython fundamentals - basic | WeiYuan
Python fundamentals - basic | WeiYuan
 
Abusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and ProfitAbusing Erlang compilation pipeline for Fun and Profit
Abusing Erlang compilation pipeline for Fun and Profit
 
RではじめるTwitter解析
RではじめるTwitter解析RではじめるTwitter解析
RではじめるTwitter解析
 
Poetry with R -- Dissecting the code
Poetry with R -- Dissecting the codePoetry with R -- Dissecting the code
Poetry with R -- Dissecting the code
 
Haskell retrospective
Haskell retrospectiveHaskell retrospective
Haskell retrospective
 
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
HBaseCon2017 Warp 10, a novel approach to managing and analyzing time series ...
 
Juggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDBJuggling Chainsaws: Perl and MongoDB
Juggling Chainsaws: Perl and MongoDB
 
Lettering js
Lettering jsLettering js
Lettering js
 
dplyr and torrents from cpasbien
dplyr and torrents from cpasbiendplyr and torrents from cpasbien
dplyr and torrents from cpasbien
 

More from Dirk Roorda

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
Dirk Roorda
 
Textpy
TextpyTextpy
Textpy
Dirk Roorda
 
General Missives
General MissivesGeneral Missives
General Missives
Dirk Roorda
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
Dirk Roorda
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
Dirk Roorda
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
Dirk Roorda
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
Dirk Roorda
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
Dirk Roorda
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
Dirk Roorda
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
Dirk Roorda
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
Dirk Roorda
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
Dirk Roorda
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Award
AwardAward
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
Dirk Roorda
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
Dirk Roorda
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
Dirk Roorda
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
Dirk Roorda
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
Dirk Roorda
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
Dirk Roorda
 

More from Dirk Roorda (20)

TF-FAIR.pdf
TF-FAIR.pdfTF-FAIR.pdf
TF-FAIR.pdf
 
Textpy
TextpyTextpy
Textpy
 
General Missives
General MissivesGeneral Missives
General Missives
 
Tf in-context
Tf in-contextTf in-context
Tf in-context
 
Quran and Text-Fabric
Quran and Text-FabricQuran and Text-Fabric
Quran and Text-Fabric
 
Ancient corpora analysis
Ancient corpora analysisAncient corpora analysis
Ancient corpora analysis
 
Qdf2tf
Qdf2tfQdf2tf
Qdf2tf
 
Verbal Valency in Hebrew Verbs
Verbal Valency in Hebrew VerbsVerbal Valency in Hebrew Verbs
Verbal Valency in Hebrew Verbs
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Annotating the Hebrew Bible
Annotating the Hebrew BibleAnnotating the Hebrew Bible
Annotating the Hebrew Bible
 
20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen20151111 utrecht ver theolbibliothecarissen
20151111 utrecht ver theolbibliothecarissen
 
Text as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew BibleText as Data: processing the Hebrew Bible
Text as Data: processing the Hebrew Bible
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Award
AwardAward
Award
 
Datamanagement for Research: A Case Study
Datamanagement for Research: A Case StudyDatamanagement for Research: A Case Study
Datamanagement for Research: A Case Study
 
Laf fabric-dh benelux2014
Laf fabric-dh benelux2014Laf fabric-dh benelux2014
Laf fabric-dh benelux2014
 
Data Analysis in the Hebrew Bible
Data Analysis in the Hebrew BibleData Analysis in the Hebrew Bible
Data Analysis in the Hebrew Bible
 
LAF Fabric
LAF FabricLAF Fabric
LAF Fabric
 
Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05Auto ingest demo-werklunch 2013-11-05
Auto ingest demo-werklunch 2013-11-05
 
Shebanq roma-2013-10-01
Shebanq roma-2013-10-01Shebanq roma-2013-10-01
Shebanq roma-2013-10-01
 

Recently uploaded

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
PECB
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
Amin Marwan
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
heathfieldcps1
 
ZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptxZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptx
dot55audits
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
Celine George
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
S. Raj Kumar
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
MysoreMuleSoftMeetup
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
Celine George
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
MJDuyan
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
BoudhayanBhattachari
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
History of Stoke Newington
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
TechSoup
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
Nicholas Montgomery
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
GeorgeMilliken2
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
RAHUL
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 

Recently uploaded (20)

clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
ISO/IEC 27001, ISO/IEC 42001, and GDPR: Best Practices for Implementation and...
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdfIGCSE Biology Chapter 14- Reproduction in Plants.pdf
IGCSE Biology Chapter 14- Reproduction in Plants.pdf
 
The basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptxThe basics of sentences session 6pptx.pptx
The basics of sentences session 6pptx.pptx
 
ZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptxZK on Polkadot zero knowledge proofs - sub0.pptx
ZK on Polkadot zero knowledge proofs - sub0.pptx
 
How to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 InventoryHow to Setup Warehouse & Location in Odoo 17 Inventory
How to Setup Warehouse & Location in Odoo 17 Inventory
 
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching AptitudeUGC NET Exam Paper 1- Unit 1:Teaching Aptitude
UGC NET Exam Paper 1- Unit 1:Teaching Aptitude
 
Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47Mule event processing models | MuleSoft Mysore Meetup #47
Mule event processing models | MuleSoft Mysore Meetup #47
 
How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17How to Make a Field Mandatory in Odoo 17
How to Make a Field Mandatory in Odoo 17
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) CurriculumPhilippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
Philippine Edukasyong Pantahanan at Pangkabuhayan (EPP) Curriculum
 
B. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdfB. Ed Syllabus for babasaheb ambedkar education university.pdf
B. Ed Syllabus for babasaheb ambedkar education university.pdf
 
The History of Stoke Newington Street Names
The History of Stoke Newington Street NamesThe History of Stoke Newington Street Names
The History of Stoke Newington Street Names
 
Leveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit InnovationLeveraging Generative AI to Drive Nonprofit Innovation
Leveraging Generative AI to Drive Nonprofit Innovation
 
Film vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movieFilm vocab for eal 3 students: Australia the movie
Film vocab for eal 3 students: Australia the movie
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
What is Digital Literacy? A guest blog from Andy McLaughlin, University of Ab...
 
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UPLAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
LAND USE LAND COVER AND NDVI OF MIRZAPUR DISTRICT, UP
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 

Text fabric

  • 1. text + annotations represented for slicing and dicing - collaboration - versioning Dirk Roorda SBL San Antonio 2016-11-18
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8. ... <clause id="3" rela="Objc"> <phrase id="27" type="VP"> <word id="153" sp="verb" lex="BR&gt;[" phono="bārˈā" lex_heb="‫א‬ ָּ‫ר‬ ָּ‫"ב‬ trans="B.@R@74&gt;" gloss="create" partofspeech="verb" trailer=" ">‫א‬ ָָּ֣‫ר‬ ָּ‫/<ב‬word> <word ...>...</word> ... </phrase> <phrase ...>...</phrase> ... </clause> ... straightforward XML - bloated - hierarchy problems - inferior tooling (xslt) - no separation of concerns - a horrible mess
  • 9. - even more bloated + graph (nodes/edges) + LAF-Fabric (python) + separation of concerns ~ an awful load of megabytes
  • 10. LAF-Fabric binary ~ reasonably compact - a bit opaque + column storage + python (pickle + gzip) + separation of concerns ~ starting to look decent
  • 11. Text Fabric + compact - transparent + column storage + pure python + separation of concerns ~ with the syntax bureaucracy out of the way, we can start to look into the real problems: * slicing and dicing * collaboration * versioning B R>CJT/ BR>[ >LHJM/ >T H CMJM/ W >T H >RY/ W H >RY/ HJH[ THW/ W BHW/ W XCK/ first 20 words from 200000 200000 BN/ KL/ H JWM/ B JWM/ PC<[ >DWM/ MN TXT/ JD/ JHWDH/ W MLK[ <L MLK/ W <BR[ JWRM/ Y<JR=/
  • 12. Text Fabric (inside) LAF-Fabric, with the LAF replaced by Text Inside TF, the LAF-Fabric-API just works: for n in NN(): if F.otype.v(n) == 'verse': label = T.passage(n) text = T.words( L.d('word', n), fmt='pf', ) print('{}t{}n'.format( label, text, ))
  • 13. Text Fabric (skeleton) A TF dataset is a bunch of text files with at least otype and monads otype 0-426580 word 426581-514580 clause 514581-605142 clause_atom 605143-858316 phrase 858317-1125831 phrase_atom 1125832-1189401 sentence 1189402-1253740 sentence_atom 1253741-1367532 subphrase 1367533-1367571 book 1367572-1368500 chapter 1368501-1413680 half_verse 1413681-1436893 verse monads 1367533 1-28762 1367534 28763-52510 1367572 1-673 1367573 674-1167 1413681 1-11 1413682 12-31 1368501 1-4 1368502 5-11 1125832 1-11 1125833 12-18 1189402 1-11 1189403 12-18 426581 1-11 426582 12-18 514581 1-11 514582 12-18 605143 1-2 605144 3 858317 1-2 858318 3 1253741 5-7 1253742 9-11 This is the skeleton: • the positions • the containment of all text objects
  • 14. text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ parent 0-1,858317605143 2,858318 605144 3,858319 605145 4-10,858320 605146 426581,1189402 1125832 514581,605143-605146426581 information content trailer _ _ _ _ _ _ ‫_׃‬ _ _ _ _ _ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs Text Fabric (flesh)
  • 15. parent 0-1,858317605143 2,858318 605144 3,858319 605145 4-10,858320 605146 514581,605143-605146426581 426581,1189402 1125832 Edges (c't'd) 0 1 2 3 4 5 6 7 8 9 10 word 858317 858318 858319 858320 phrase_atom bᵊrēšˌîṯ bārˈā ʔᵉlōhˈîm ʔˌēṯ haššāmˌayim wᵊʔˌēṯ hāʔˈāreṣ 605143 605144 605145 605146 phrase 426581 clause 1125832 sentence
  • 16. flesh text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ trailer _ _ _ _ _ _ ‫_׃‬ _ _ _ _ _ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs typ PP NP VP NP - Objc Resu skeleton otype word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause monads 600000 0-1 600001 2 600002 3 600003 4-10 400001 0-10 400002 11-21 400003 22-34 vertical: feature selection horizontal: object selection both: modules slicing 'n dicing
  • 17. collaborationflesh text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ p-s-p prep subs verb subs prep art subs conj prep art subs conj art subs verb subs conj subs conj subs skeleton otype word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause monads 600000 0-1 600001 2 600002 3 600003 4-10 400001 0-10 400002 11-21 400003 22-34 module: feature strong for words module strong 8675 7225 1254 a 430 853 8676 8064 8678 853 8676 776 8678 8676 776 1961 8414 8678 922 8678 2822 dependency: on the implicit monad order!
  • 18. @20170101 @20161118 0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 10 9 11 10 12 11 13 12 14 13 15 14 16 15 17 16 18 17 19 18 20 19 21 20 600001 600000 600002 600001 600003 600002 600004 600003 400002 400001 400003 400002 400004 400003 ⇐ flesh @20170101 text_full ְּ‫ב‬ ‫ית‬ ִׁ֖‫אש‬ ֵ‫ר‬ ‫א‬ ָָּ֣‫ר‬ ָּ‫ב‬ ‫ים‬ ִ֑‫ֹלה‬ֱ‫א‬ ‫ת‬ ֵֵ֥‫א‬ ְּ‫ה‬ ‫ם‬‫י‬ ִׁ֖‫מ‬ ָּ‫ש‬ ְּ‫ו‬ ‫ת‬ ֵֵ֥‫א‬ ‫ש‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּֽ‫א‬ ְּ‫ו‬ ְָּּ‫ה‬ ‫ץ‬ ֶ‫ר‬ ָָּ֗‫א‬ ‫ה‬ ֵָּ֥‫ת‬‫י‬ ָּ‫ה‬ ְּ‫הו‬ ֹ֨‫ת‬ ְָּּ‫ו‬ ‫הו‬ ֹ֔‫ב‬ ְּ‫ו‬ ‫ְך‬ ֶ‫ש‬ ִׁ֖‫ח‬ otype word word word word word word word word word word word word word word word word word word word word phrase phrase phrase phrase clause clause clause skeleton @20170101 monads 9 600001 0-1 600002 2 600003 3 600004 4-11 400002 0-11 400003 12-22 400004 23-35 strong 8675 7225 1254 a 430 853 8676 8064 8678 853 8676 776 8678 8676 776 1961 8414 8678 922 8678 2822 module @20161118versioning old module works as is on new data
  • 19. the end of the beginning etcbc Stephen Ku David van Acker James Cuénod skeleton @20160101 <= @20170101 module strong @20161118 flesh @20160101 flesh @20170101 module accent @20170202 skeleton @20170101 <= @20180101 module strong @20171118 flesh @20180101 module assoc @20180303 skeleton @20180101 <= @20190101 flesh @20190101 skeleton @20190101 <= @20200101