Stylometry of literary papyri
Holger Essler, Jeremi K. Ochab
Institute of Physics
Jagiellonian University
DATeCH 2019
10th May 2019 Brussels
Questions&Aims
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Questions&Aims
How can we correct/improve/disambiguate
uncertain or not specific metadata?
› Author
› Genre
› Place
› Date
› Keywords
› …
› Can we extract them from text?
Data
Metadata
Processing
Data
Data
Data
https://github.com/DCLP/idp.data/tree/dclp/DCLP
Data
10
14624
metadata
Data
14624
metadata
748 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions
298 transcriptions
748 transcriptions
Data
14624
metadata
279 transcriptions (paraliterary)
298 transcriptions
748 transcriptions
Data
14624
metadata
Data: metadata
14624
metadata
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
748 transcriptions
• Greek
• known author
• >50 words
Data: metadata
14624
metadata
298 transcriptions
Data: metadata
14624
metadata
298 transcriptions
www.trismegistos.org
/place/2722
/authorwork/3062
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
Philodemus
Single-text authors
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: metadata
14624
metadata
298 transcriptions
• Author
• Place (written): region, nome
• Place (found): region, nome
• Date: not before, not after
• Keywords, e.g.: prose, gospel,
literature, Christian
• Material, e.g.: papyrus, pottery
Data: cleaning
14624
298 transcriptions
http://papyri.info/docs/leiden_plus
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
14624
298 transcriptions
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Data: cleaning
298 transcriptions
Two strategies:
v diversifying: by retaining <orig>, <hi>,
but omitting <reg> and <ex>
v normalising: by omitting <orig>, <hi>,
but retaining <reg> and <ex>
Manually tagged for:
• line break, paragraph, gap,
hand-shift markers
• spelling corrections
• unclear reading
• deletions
• expanded abbreviations
• orthographic regularisation
• diacritical marks
• certainty or reason for above
Methods
Distance-based clustering
Community detection in networks
Clustering quality measures
Distance-based clustering
Compute text similarity
Distance-based clustering
Compute text similarity » word frequencies
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchically cluster (unsupervised)
› single, complete, …
› Ward linkage
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Distance-based clustering
Compute text similarity » word frequencies
› Burrow’s delta » taxi distance of z-scored freqs
› Eder’s, Argamon’s, …
› cosine delta
Hierarchicaly cluster (unsupervised)
› single, complete, …
› Ward linkage
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely
authorship. Literary and Linguistic Computing, 17(3), 267–287.
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
› Louvain
(modularity)
› Informap
› OSLOM
› …
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
› Louvain
(modularity)
› Informap
› OSLOM
› …
Community detection
in networks
MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003)
167–256
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis:
Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH
Conference Abstracts 2019
› Louvain
(modularity)
› Informap
› OSLOM
› …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
Clustering quality measures
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index
› mutual information
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
Clustering quality measures
Nguyen Xuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic
measures for clusterings comparison: is a correction for chance necessary?. In
Proceedings of the 26th International Conference on Machine Learning. PMLR.
1073–1080.
Many different indices:
› Jaccard, Dunn, silhouette, Davies-Boulding, …
› Rand index » adjusted
› mutual inf. » normalised » adjusted » standardised
› some selection bias remaining
(number and size of clusters)
Results
Results
It is hard!
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
› Best network clustering
» modularity optimisation: AMI=0.22 (very low)
» number of clusters: 7
› Which similarity measure
» Burrows’s delta: AMI<0.1 (terrible)
» cosine delta: AMI=0.25 (very low, ~0.6 in novels)
» number of clusters: 15-25 (close)
Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from
fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004),
025101.
Results
Results
Results
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Results
Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital
Scholarship in the Humanities 32, 1 (2017), 50–64.
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
Problems:
mbalanced data
text sizes
Outlook:
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
N-grams + SVD to circumvent sparseness
augment texts preserved by medieval transmission
supervised ML
Predict or narrow down: genre/text type, dates,places,
…
Documentary papyri
Conclusions
› Results
o clustering depends on text regularisation
o trade-off between sparseness and distinctivness of
features(?)
› Problems
o imbalanced data
o texts too small
› Outlook
o N-grams + SVD to circumvent sparseness
o augment texts preserved by medieval transmission
o supervised ML to narrow down:
genre/text type, dates, places, …
o Documentary papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
H Essler
S Pielström
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019.
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
J Rybicki
Institute of English Studies
Jagiellonian University
Grants: 2017/26/E/HS2/01019
M Eder
J Byszuk
Thank
you!
Questions?
References:
› J.K. Ochab, J. Byszuk, S. Pielström,
M. Eder, "Identifying Similarities in
Text Analysis: Hierarchical
Clustering (Linkage) versus
Network Clustering (Community
Detection)," DH Conference
Abstracts 2019
› computationalstylistics.github.io
› https://github.com/computation
alstylistics/stylometry_of_papyri
H Essler
S Pielström
58
Thank
you!
Questions?

Session6 02.jeremi ochab

  • 1.
    Stylometry of literarypapyri Holger Essler, Jeremi K. Ochab Institute of Physics Jagiellonian University DATeCH 2019 10th May 2019 Brussels
  • 2.
  • 3.
    Questions&Aims How can wecorrect/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › …
  • 4.
    Questions&Aims How can wecorrect/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 5.
    Questions&Aims How can wecorrect/improve/disambiguate uncertain or not specific metadata? › Author › Genre › Place › Date › Keywords › … › Can we extract them from text?
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Data: metadata 14624 metadata 748 transcriptions •Greek • known author • >50 words
  • 16.
    Data: metadata 14624 metadata 298 transcriptions 748transcriptions • Greek • known author • >50 words
  • 17.
  • 18.
  • 19.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 20.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 21.
    Data: metadata 14624 metadata 298 transcriptions Philodemus Single-textauthors • Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 22.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 23.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 24.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 25.
    Data: metadata 14624 metadata 298 transcriptions •Author • Place (written): region, nome • Place (found): region, nome • Date: not before, not after • Keywords, e.g.: prose, gospel, literature, Christian • Material, e.g.: papyrus, pottery
  • 26.
  • 27.
    Data: cleaning 14624 298 transcriptions Manuallytagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 28.
    Data: cleaning 14624 298 transcriptions Manuallytagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 29.
    Data: cleaning 298 transcriptions Twostrategies: v diversifying: by retaining <orig>, <hi>, but omitting <reg> and <ex> v normalising: by omitting <orig>, <hi>, but retaining <reg> and <ex> Manually tagged for: • line break, paragraph, gap, hand-shift markers • spelling corrections • unclear reading • deletions • expanded abbreviations • orthographic regularisation • diacritical marks • certainty or reason for above
  • 30.
    Methods Distance-based clustering Community detectionin networks Clustering quality measures
  • 31.
  • 32.
    Distance-based clustering Compute textsimilarity » word frequencies
  • 33.
    Distance-based clustering Compute textsimilarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 34.
    Distance-based clustering Compute textsimilarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchically cluster (unsupervised) › single, complete, … › Ward linkage Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 35.
    Distance-based clustering Compute textsimilarity » word frequencies › Burrow’s delta » taxi distance of z-scored freqs › Eder’s, Argamon’s, … › cosine delta Hierarchicaly cluster (unsupervised) › single, complete, … › Ward linkage J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 Burrows, J.F. (2002). “Delta:” A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287.
  • 36.
    Community detection in networks MEJNewman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 37.
    Community detection in networks ›Louvain (modularity) › Informap › OSLOM › … MEJ Newman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256
  • 38.
    Community detection in networks MEJNewman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. › Louvain (modularity) › Informap › OSLOM › …
  • 39.
    Community detection in networks MEJNewman, The Structure and Function of Complex Networks, SIAM REVIEW 45 (2003) 167–256 Maciej Eder. 2017. Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64. J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › Louvain (modularity) › Informap › OSLOM › …
  • 40.
    Clustering quality measures Manydifferent indices: › Jaccard, Dunn, silhouette, Davies-Boulding, …
  • 41.
    Clustering quality measures Manydifferent indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index › mutual information
  • 42.
    Clustering quality measures NguyenXuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised
  • 43.
    Clustering quality measures NguyenXuan Vinh, Julien Epps, and James Bailey. 2009. Information theoretic measures for clusterings comparison: is a correction for chance necessary?. In Proceedings of the 26th International Conference on Machine Learning. PMLR. 1073–1080. Many different indices: › Jaccard, Dunn, silhouette, Davies-Boulding, … › Rand index » adjusted › mutual inf. » normalised » adjusted » standardised › some selection bias remaining (number and size of clusters)
  • 44.
  • 45.
  • 46.
    Results › Best networkclustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 47.
    Results › Best networkclustering » modularity optimisation: AMI=0.22 (very low) » number of clusters: 7 › Which similarity measure » Burrows’s delta: AMI<0.1 (terrible) » cosine delta: AMI=0.25 (very low, ~0.6 in novels) » number of clusters: 15-25 (close) Roger Guimera, Marta Sales-Pardo, and Luís A Nunes Amaral. 2004. Modularity from fluctuations in random graphs and complex networks. Physical Review E 70, 2 (2004), 025101.
  • 48.
  • 49.
  • 50.
  • 51.
    Results Maciej Eder. 2017.Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 52.
    Results Maciej Eder. 2017.Visualization in stylometry: Cluster analysis using networks. Digital Scholarship in the Humanities 32, 1 (2017), 50–64.
  • 53.
    Conclusions › Results o clusteringdepends on text regularisation o trade-off between sparseness and distinctivness of features(?) Problems: mbalanced data text sizes Outlook: N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 54.
    Conclusions › Results o clusteringdepends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small N-grams + SVD to circumvent sparseness augment texts preserved by medieval transmission supervised ML Predict or narrow down: genre/text type, dates,places, … Documentary papyri
  • 55.
    Conclusions › Results o clusteringdepends on text regularisation o trade-off between sparseness and distinctivness of features(?) › Problems o imbalanced data o texts too small › Outlook o N-grams + SVD to circumvent sparseness o augment texts preserved by medieval transmission o supervised ML to narrow down: genre/text type, dates, places, … o Documentary papyri
  • 56.
    J Rybicki Institute ofEnglish Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk H Essler S Pielström References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019. › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri
  • 57.
    J Rybicki Institute ofEnglish Studies Jagiellonian University Grants: 2017/26/E/HS2/01019 M Eder J Byszuk Thank you! Questions? References: › J.K. Ochab, J. Byszuk, S. Pielström, M. Eder, "Identifying Similarities in Text Analysis: Hierarchical Clustering (Linkage) versus Network Clustering (Community Detection)," DH Conference Abstracts 2019 › computationalstylistics.github.io › https://github.com/computation alstylistics/stylometry_of_papyri H Essler S Pielström
  • 58.