Ensuring accessible textbooks for children with disabilities is essential for inclusive education. However, providing native accessibility for educational content remains a challenge. In the mean time, existing educational materials need to be adapted, for example by providing interactive versions to overcome difficulties caused by disabilities. In this context, our project aims to automatically adapt PDF textbooks to make them accessible to children with disabilities. The first step towards this adaptation involves extracting and structuring the content of textbooks. In this paper, we introduce textbook models, propose an automated extraction pipeline, and conduct preliminary experiments. Our textbook models are based on the various activities involved and provide layout and semantic information. They enable normalized and structured representations of educational content at both document and page levels, facilitating the automatic extraction process and the conversion to popular formats such as TEI and DocBook. In order to automatically extract PDF textbooks structure, our experiments, using a state-of-the-art multimodal transformer for a token classification task, demonstrate promising results. However, these experiments also highlight the difficulty of the task, especially cross-textbook collection generalization. Finally, we discuss the extraction pipeline and the directions of future work.
Recombinant DNA technology( Transgenic plant and animal)
Â
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction
1. Layout- and Activity-based Textbook
Modeling for Automatic PDF Textbook
Extraction
Ălise Lincker, Olivier Pons, Camille Guinaudeau, Isabelle Barbet, JĂŠrĂ´me
Dupire, CĂŠline Hudelot, Vincent Mousseau and Caroline Huron
guinaudeau@limsi.fr
2. MAnueLs INclusifs project
Paper textbooks remain prevalent in
schools in France
Ensuring accessible textbooks for
children with disabilities is essential
for inclusive education
Textbook adaptation is done manually
by NGO
Developmental Coordination Disorder
(DCD)
Ă impairment in motor coordination
Ă eye movement disorders
2
3. Automatic adaptation of textbooks
Limited work on extraction and automatic analysis of textbooks
⢠Transformation of PDF textbooks into intelligent educational resources[1]
transforms PDF textbook into interactive digital version
based on formal structure and hierarchy modeling
⢠Digitalization of Electronic Textbook Based on OPENCV[2]
identification and cutting of target areas followed by OCR approach
specifically developed for textbooks
o Extraction step relies in rules
3
6. Textbook modeling
Textbook modeling is fundamental for textbook segmentation research
Markup schemes: HTML, DocBook and Text Encoding Initiative (TEI)
Proposed models adopt a generic structure (headings, sections, body text,
paragraphs)[3]
MALIN project:
Description of all the types of instructional activities present in textbooks
Inspiration: conceptual guides for the elaboration of textbooks[4]
5
7. Textbook modeling
6
Observation of dozens of
elementary school textbooks
Model inspired by:
Ă existing models
Ă guidelines for textbook
creation
Ă parsing of textbook
collections to list and
group nestings of all
elements
8. Textbook modeling
The first model captures the encapsulation of elements
Additional information is represented using an âindicatorâ tag
<lesson type="vocabulary"âŚ>
Smallest linguistic units : tokens <token sep="space">
Example: câŚbat, 4 x ⌠= 8, Manon has lost ⌠cat
Ă Grouped into text segments
List and table: 2 or more lists can be linked
Possible incorporation of additional semantic and morpho-syntactic attributes
7
9. Textbook modeling
Second model: content extraction task is performed at the page level
Each token and segment tags is assigned position and style attributes
<token sep="space" font="arial">
chapter, theme or discipline titles constitute an element on their own separate
from activities
Ă associated with potential indicators <indicator type="revision">
Consistency with previous research
Our format can be converted to DocBook and TEI
8
10. Automatic content extraction
Textbooks are parsed to an XML file with pdfalto and MuPDF
Ă extraction of words along with their font style and spatial coordinates, as well as
images
Ă words are grouped into text segments based on rules on font sizes and styles,
spacing between tokens, etc.
Web-based annotation interface to label segment according to their role
Manual annotation is then used to train a deep learning model for automatic
textbook extraction
9
11. Automatic content extraction
Manual annotation provides a dataset composed by 167 pages of one textbook
train: 120 pages
validation: 17 pages
test: 30 pages
+ 30 pages of a textbook of the same collection
+ 30 pages of a textbook of a different collection
token is associated with a coarse-grained page region label: discipline, chapter,
heading, introductory activity, lesson, exercise, page number
10
12. Automatic content extraction
LayoutLM inspired for educational French
LiLT combined with the french language model CamenBERT, fine-tuned on
textbooks and reading materials
Pages longer than 512 tokens encoded with 2 overlapping segments
2 predictions are generated for each segment and aligned for the entire page
if different:
Ă re-encoded with additional context tokens on the left and right sides and passed
through the model
Ă the 3 predictions for the overlapping part are merged using a majority vote
11
14. Automatic content extraction
Good accuracy but:
⢠coarse-grained labels
⢠activities that necessitate a shift in the mode of interaction require more in-depth
extraction
13
15. Conclusion and future work
Contributions:
⢠Definition of layout- and activity-based textbook models
⢠Automatic extraction of textbook content (coarse-grained)
Limitations and future work:
⢠variation in textbooks collection
⢠only French language study textbooks
⢠fine-grained extraction
14
Data augmentation and
generation
17. References
[1] I. Alpizar-Chacon et al. Transformation of PDF textbooks into intelligent educational resources. In
Proceedings of the 2nd International Workshop on Inteligent Textbooks, 21st International Conference on
Artificial Intelligence in Education, 2020.
[2] Z.-M. Deng, et al. Digitalization of Electronic Textbook Based on OPENCV. In Proceedings of the International
Conference on Machine Learning and Cybernetics, 2020.
[3] L.-L. Stahn et al. Using TEI for textbook research. In Proceedings
of the Workshop on Language Technology Resources and Tools for Digital Humanities
(LT4DH), 2016.
[4] F.-M. GĂŠrard et al. Des manuels scolaires pour apprendre: concevoir, ĂŠvaluer,
utiliser, De Boeck SupĂŠrieur, 2009.
18. Textbook modeling
A unique identifier (@id) and positional
attributes (@xmin, @ymin, @xmax, @ymax) are
assigned to all elements.
Attributes in italics correspond to layout and
style characteristics added for the page
extraction mode