Sep. 26, 2019•0 likes•29 views

Download to read offline

Report

Technology

Slides of the paper Towards a generic unsupervised method for transcription of encoded manuscripts by Arnau Baró, Jialuo Chen, Alicia Fornés and Beáta Megyesi at the 3rd Edition of the DATeCH2019 International Conference

Data Analysis with TensorFlow in PostgreSQLEDB

May workshopFahadahammed2

May 15 workshopFahadahammed2

IDS for IoT.pptxRashilaShrestha

Rui Meng - 2017 - Deep Keyphrase GenerationAssociation for Computational Linguistics

Practice discovering biological knowledge using networks approach.Elena Sügis

- 1. Towards a Generic Unsupervised Method for Transcription of Encoded Manuscripts Arnau Baró, Jialuo Chen, Alicia Fornés, Beáta Megyesi
- 2. Outline ● Introduction ● Methodology ○ Preprocessing ○ Clustering ○ Transcription ● Results ○ Datasets ● Conclusions and Future Work 2
- 3. Historical Chiper 3 ● A cipher is an encrypted message to keep its content hidden from others than the intended receiver. ● Contains hidden messages. ● It is estimated that the 1% of the documents in archives contain cipher text. ● Diﬀerent ciphers: ○ Diplomatic correspondence. ○ Scientiﬁc writings. ○ Manuscripts related to secret societies.
- 4. Historical Chiper 4 ● A cipher is an encrypted message to keep its content hidden from others than the intended receiver. ● Contains hidden messages. ● It is estimated that the 1% of the documents in archives contain cipher text. ● Diﬀerent ciphers: ○ Diplomatic correspondence. ○ Scientiﬁc writings. ○ Manuscripts related to secret societies. Clear Text
- 5. Introduction DECODE: Automatic Decoding of Historical Manuscripts Goal: To develop resources and tools for automatic decryption of enciphered documents from early modern times By cross-disciplinary research: computational linguistics, computer science, image processing, history, linguistics and philology. 5
- 6. Decryption of encoded documents 6
- 7. Motivation Diﬃculties in the transcription: ● Symbol alphabets are diﬀerent. ● Ciphers with diﬀerent handwriting styles. ● It’s diﬃcult to have labeled data to train. 7
- 8. Motivation Diﬃculties in the transcription: ● Symbol alphabets are diﬀerent. ● Ciphers with diﬀerent handwriting styles. ● It’s diﬃcult to have labeled data to train. Proposal: Unsupervised generic transcription method. 8
- 9. Outline ● Introduction ● Methodology ○ Preprocessing ○ Clustering ○ Transcription ● Results ○ Datasets ● Conclusions and Future Work 9
- 10. Pipeline 10 Input Preprocessing Clustering Transcription Symbol 1 Symbol 2 Symbol 3
- 11. Numerical Preprocessing Borg Copiale 11
- 12. Preprocessing - Binarization Borg Copiale 12 Numerical
- 13. Preprocessing - Segmentation 13
- 14. Preprocessing - Segmentation 14 The segmentation of symbols without knowing the symbol alphabet is diﬃcult.
- 15. Unsupervised clustering 15 ● Hierarchical k-means algorithm ○ SIFT as a descriptor. ○ The process ﬁnishes when the clusters have few symbols. ○ Used to obtain the initial seeds for the Label Propagation algorithm. We can speed up the process parametrizing the maximum number of clusters and/or the number of symbols per cluster.
- 16. Label Propagation 16 ● Consists of assigning labels to unlabeled elements. ○ We start the process with the clusters generated by the kmeans. ● We have used K-Nearest Neighbours as kernel of the algorithm. ○ We compare the neighbour by a SIFT descriptor. ● Propagation step ○ Soft-assignment
- 17. Label Propagation 17 ● Consists of assigning labels to unlabeled elements. ○ We start the process with the clusters generated by the kmeans. ● We have used K-Nearest Neighbours as kernel of the algorithm. ○ We compare the neighbour by a SIFT descriptor. ● Propagation step ○ Soft-assignment
- 18. Label Propagation 18 ● Consists of assigning labels to unlabeled elements. ○ We start the process with the clusters generated by the kmeans. ● We have used K-Nearest Neighbours as kernel of the algorithm. ○ We compare the neighbour by a SIFT descriptor. ● Propagation step ○ Soft-assignment
- 19. Transcription ● Each symbol has a vector of probabilities (except label propagation seeds). ● For each symbol, we assign the label if its probability is higher than the threshold. Otherwise, that symbol will be labelled as unknown (*). ● This decision is based on the fact that users prefer symbols without labels rather than symbols with wrong labels. 19
- 20. Outline ● Introduction ● Methodology ○ Preprocessing ○ Clustering ○ Transcription ● Results ○ Datasets ● Conclusions and Future Work 20
- 21. Numerical Datasets Borg Copiale 21 Metrics ● Symbol Error Rate: Percentage of errors in the transcription. ● Missing Symbols: Percentage of symbols that remain untranscribed (the conﬁdence is lower than the threshold). ● Symbol Coverage:
- 22. Results - Copiale Dataset 22 ● Large alphabet. ● Challenge: 99 diﬀerent symbols. ● 103 pages.
- 23. Results - Copiale Dataset 23 ● Large alphabet. ● Challenge: 99 diﬀerent symbols. ● 103 pages.
- 24. Results - Borg Dataset 24 ● Challenge: ○ Lots of touching symbols. ● 34 diﬀerent symbols. ● 30 pages.
- 25. Results - Borg Dataset 25 ● Challenge: ○ Lots of touching symbols. ● 34 diﬀerent symbols. ● 30 pages.
- 26. Results - Numerical Dataset 26 ● Small alphabet (10 symbols) with many ornaments and clear text. ● 29 pages. ● Challenge: 5 diﬀerent handwriting styles.
- 27. Results - Numerical Dataset 27 ● Small alphabet (10 symbols) with many ornaments and clear text. ● 29 pages. ● Challenge: 5 diﬀerent handwriting styles.
- 28. Qualitative Results Borg Copiale 28
- 29. User Eﬀort Unsupervised Method - Training free (Our Method) Two scenarios: ● No user intervention ● User intervention (1-2 hours) ○ The user must select one cluster for each glyph given the cipher’s symbol set. ○ The user should also remove outliers in these clusters. ○ We have asked a non-expert user to do these tasks, because, an expert is not required for tasks related to shape similarity. Training based Method (MDLSTM) Multi-dimensional Recurrent Neural Network (LSTM) ● An expert user must transcribe diﬀerent amount of pages, concretely, the 26%, 34%, 42% and 50% of the full cipher (e.g. for Copiale, it means transcribing 25, 34, 42 or 51 pages, respectively). 29
- 30. Results - Comparison with a supervised method 30 Method User Intervention Confidence Threshold Borg SER Missing sym. Ours None 0.4 0.542 0.377 Select Clusters 0.522 0.173 MDLSTM Label 26% of data 0.4 0.715 0.154 Label 34% of data 0.662 0.069 Label 42% of data 0.693 0.061 Label 50% of data 0.556 0.035 Ours None 0.6 0.445 0.547 Select Clusters 0.464 0.282 MDLSTM Label 26% of data 0.6 0.554 0.399 Label 34% of data 0.523 0.280 Label 42% of data 0.551 0.269 Label 50% of data 0.450 0.212 Ours None 0.8 0.377 0.713 Select Clusters 0.418 0.385 MDLSTM Label 26% of data 0.8 0.546 0.513 Label 34% of data 0.437 0.435 Label 42% of data 0.465 0.421 Label 50% of data 0.365 0.365 MDLSTM Label 50% of data ? - -
- 31. Results - Comparison with a supervised method 31 Method User Intervention Confidence Threshold Borg SER Missing sym. Ours None 0.4 0.542 0.377 Select Clusters 0.522 0.173 MDLSTM Label 26% of data 0.4 0.715 0.154 Label 34% of data 0.662 0.069 Label 42% of data 0.693 0.061 Label 50% of data 0.556 0.035 Ours None 0.6 0.445 0.547 Select Clusters 0.464 0.282 MDLSTM Label 26% of data 0.6 0.554 0.399 Label 34% of data 0.523 0.280 Label 42% of data 0.551 0.269 Label 50% of data 0.450 0.212 Ours None 0.8 0.377 0.713 Select Clusters 0.418 0.385 MDLSTM Label 26% of data 0.8 0.546 0.513 Label 34% of data 0.437 0.435 Label 42% of data 0.465 0.421 Label 50% of data 0.365 0.365 MDLSTM Label 50% of data ? - -
- 32. Results - Comparison with a supervised method 32 Method User Intervention Confidence Threshold Borg Copiale SER Missing sym. SER Missing sym. Ours None 0.4 0.542 0.377 0.444 0.031 Select Clusters 0.522 0.173 0.201 0.010 MDLSTM Label 26% of data 0.4 0.715 0.154 0.131 0.008 Label 34% of data 0.662 0.069 0.120 0.009 Label 42% of data 0.693 0.061 0.084 0.006 Label 50% of data 0.556 0.035 0.075 0.003 Ours None 0.6 0.445 0.547 0.365 0.073 Select Clusters 0.464 0.282 0.173 0.024 MDLSTM Label 26% of data 0.6 0.554 0.399 0.113 0.068 Label 34% of data 0.523 0.280 0.109 0.076 Label 42% of data 0.551 0.269 0.078 0.048 Label 50% of data 0.450 0.212 0.074 0.038 Ours None 0.8 0.377 0.713 0.264 0.138 Select Clusters 0.418 0.385 0.144 0.045 MDLSTM Label 26% of data 0.8 0.546 0.513 0.110 0.154 Label 34% of data 0.437 0.435 0.111 0.170 Label 42% of data 0.465 0.421 0.087 0.116 Label 50% of data 0.365 0.365 0.081 0.098 MDLSTM Label 50% of data ? - - - -
- 33. Results - Comparison with a supervised method 33 Method User Intervention Confidence Threshold Borg Copiale SER Missing sym. SER Missing sym. Ours None 0.4 0.542 0.377 0.444 0.031 Select Clusters 0.522 0.173 0.201 0.010 MDLSTM Label 26% of data 0.4 0.715 0.154 0.131 0.008 Label 34% of data 0.662 0.069 0.120 0.009 Label 42% of data 0.693 0.061 0.084 0.006 Label 50% of data 0.556 0.035 0.075 0.003 Ours None 0.6 0.445 0.547 0.365 0.073 Select Clusters 0.464 0.282 0.173 0.024 MDLSTM Label 26% of data 0.6 0.554 0.399 0.113 0.068 Label 34% of data 0.523 0.280 0.109 0.076 Label 42% of data 0.551 0.269 0.078 0.048 Label 50% of data 0.450 0.212 0.074 0.038 Ours None 0.8 0.377 0.713 0.264 0.138 Select Clusters 0.418 0.385 0.144 0.045 MDLSTM Label 26% of data 0.8 0.546 0.513 0.110 0.154 Label 34% of data 0.437 0.435 0.111 0.170 Label 42% of data 0.465 0.421 0.087 0.116 Label 50% of data 0.365 0.365 0.081 0.098 MDLSTM Label 50% of data ? - - - -
- 34. Results - Comparison with a supervised method 34 Method User Intervention Confidence Threshold Borg Copiale Numerical SER Missing sym. SER Missing sym. SER Missing sym. Ours None 0.4 0.542 0.377 0.444 0.031 0.275 0.053 Select Clusters 0.522 0.173 0.201 0.010 0.189 0.013 MDLSTM Label 26% of data 0.4 0.715 0.154 0.131 0.008 - - Label 34% of data 0.662 0.069 0.120 0.009 - - Label 42% of data 0.693 0.061 0.084 0.006 - - Label 50% of data 0.556 0.035 0.075 0.003 - - Ours None 0.6 0.445 0.547 0.365 0.073 0.225 0.137 Select Clusters 0.464 0.282 0.173 0.024 0.167 0.033 MDLSTM Label 26% of data 0.6 0.554 0.399 0.113 0.068 - - Label 34% of data 0.523 0.280 0.109 0.076 - - Label 42% of data 0.551 0.269 0.078 0.048 - - Label 50% of data 0.450 0.212 0.074 0.038 - - Ours None 0.8 0.377 0.713 0.264 0.138 0.197 0.231 Select Clusters 0.418 0.385 0.144 0.045 0.146 0.056 MDLSTM Label 26% of data 0.8 0.546 0.513 0.110 0.154 - - Label 34% of data 0.437 0.435 0.111 0.170 - - Label 42% of data 0.465 0.421 0.087 0.116 - - Label 50% of data 0.365 0.365 0.081 0.098 - - MDLSTM Label 50% of data ? - - - - 0.12 -
- 35. Results - Comparison with a supervised method 35 Method User Intervention Confidence Threshold Borg Copiale Numerical SER Missing sym. SER Missing sym. SER Missing sym. Ours None 0.4 0.542 0.377 0.444 0.031 0.275 0.053 Select Clusters 0.522 0.173 0.201 0.010 0.189 0.013 MDLSTM Label 26% of data 0.4 0.715 0.154 0.131 0.008 - - Label 34% of data 0.662 0.069 0.120 0.009 - - Label 42% of data 0.693 0.061 0.084 0.006 - - Label 50% of data 0.556 0.035 0.075 0.003 - - Ours None 0.6 0.445 0.547 0.365 0.073 0.225 0.137 Select Clusters 0.464 0.282 0.173 0.024 0.167 0.033 MDLSTM Label 26% of data 0.6 0.554 0.399 0.113 0.068 - - Label 34% of data 0.523 0.280 0.109 0.076 - - Label 42% of data 0.551 0.269 0.078 0.048 - - Label 50% of data 0.450 0.212 0.074 0.038 - - Ours None 0.8 0.377 0.713 0.264 0.138 0.197 0.231 Select Clusters 0.418 0.385 0.144 0.045 0.146 0.056 MDLSTM Label 26% of data 0.8 0.546 0.513 0.110 0.154 - - Label 34% of data 0.437 0.435 0.111 0.170 - - Label 42% of data 0.465 0.421 0.087 0.116 - - Label 50% of data 0.365 0.365 0.081 0.098 - - MDLSTM Label 50% of data ? - - - - 0.12 -
- 36. Discussion ● Our Method Comparing the two scenarios: ○ With a very low user intervention the performance signiﬁcantly improves and the percentage of missing symbols signiﬁcantly decreases. ● MDLSTM ○ Not surprisingly, the more training (i.e. required user eﬀort), the better results indicated by lower SER values and lower percentage of missing symbols. 36
- 37. Outline ● Introduction ● Methodology ○ Preprocessing ○ Clustering ○ Transcription ● Results ○ Datasets ● Conclusions and Future Work 37
- 38. Conclusions 38 ● We proposed an unsupervised method to transcribe ciphers with various kinds of symbol sets. ● It can be automatically applied to the whole cipher, avoiding to ask the user to transcribe any pages to train. ● The obtained results are promising taking into account that the user eﬀort is zero. ● In case we have a minimum user intervention, the accuracy is comparable to a training-based method
- 39. Future Work 39 ● We will focus on more powerful techniques for segmentation to better deal with touching elements. ● We will explore zero-shot learning and domain adaptation techniques for improving the classiﬁcation of ciphers with new type of symbol sets. ● Combination of supervised and unsupervised methods, as well as interactive transcription methods are also promising research directions. ● We are developing a web-based platform. ○ https://cl.lingfil.uu.se/decode/transcription/
- 40. Thank You! Questions? 40
- 41. 41
- 42. 42
- 43. Results - Comparison with a supervised method 43 Method User Intervention Confidence Threshold Numerical SER Missing sym. Ours None 0.4 0.275 0.053 Select Clusters 0.189 0.013 Ours None 0.6 0.225 0.137 Select Clusters 0.167 0.033 Ours None 0.8 0.197 0.231 Select Clusters 0.146 0.056 MDLSTM Label 50% of data - 0.12 -