Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction

Layout- and Activity-based Textbook
Modeling for Automatic PDF Textbook
Extraction
Élise Lincker, Olivier Pons, Camille Guinaudeau, Isabelle Barbet, Jérôme
Dupire, Céline Hudelot, Vincent Mousseau and Caroline Huron
guinaudeau@limsi.fr

MAnueLs INclusifs project
Paper textbooks remain prevalent in
schools in France
Ensuring accessible textbooks for
children with disabilities is essential
for inclusive education
Textbook adaptation is done manually
by NGO
Developmental Coordination Disorder
(DCD)
Ø impairment in motor coordination
Ø eye movement disorders
2

Automatic adaptation of textbooks
Limited work on extraction and automatic analysis of textbooks
• Transformation of PDF textbooks into intelligent educational resources[1]
transforms PDF textbook into interactive digital version
based on formal structure and hierarchy modeling
• Digitalization of Electronic Textbook Based on OPENCV[2]
identification and cutting of target areas followed by OCR approach
specifically developed for textbooks
o Extraction step relies in rules
3

Approach
01
Definition of
layout- and
activity-based
textbook models
02
Automatic
extraction of
textbook content
03
Classification of
exercises
according to their
adaptation to
DCD
04
Automatic
adaptation of
exercises
4

Approach
01
Definition of
layout- and
activity-based
textbook models
02
Automatic
extraction of
textbook content
03
Classification of
exercises
according to their
adaptation to
DCD
04
Automatic
adaptation of
exercises
CBMI
4

Textbook modeling
Textbook modeling is fundamental for textbook segmentation research
Markup schemes: HTML, DocBook and Text Encoding Initiative (TEI)
Proposed models adopt a generic structure (headings, sections, body text,
paragraphs)[3]
MALIN project:
Description of all the types of instructional activities present in textbooks
Inspiration: conceptual guides for the elaboration of textbooks[4]
5

Textbook modeling
6
Observation of dozens of
elementary school textbooks
Model inspired by:
Ø existing models
Ø guidelines for textbook
creation
Ø parsing of textbook
collections to list and
group nestings of all
elements

Textbook modeling
The first model captures the encapsulation of elements
Additional information is represented using an “indicator” tag
<lesson type="vocabulary"…>
Smallest linguistic units : tokens <token sep="space">
Example: c…bat, 4 x … = 8, Manon has lost … cat
à Grouped into text segments
List and table: 2 or more lists can be linked
Possible incorporation of additional semantic and morpho-syntactic attributes
7

Textbook modeling
Second model: content extraction task is performed at the page level
Each token and segment tags is assigned position and style attributes
<token sep="space" font="arial">
chapter, theme or discipline titles constitute an element on their own separate
from activities
àassociated with potential indicators <indicator type="revision">
Consistency with previous research
Our format can be converted to DocBook and TEI
8

Automatic content extraction
Textbooks are parsed to an XML file with pdfalto and MuPDF
à extraction of words along with their font style and spatial coordinates, as well as
images
à words are grouped into text segments based on rules on font sizes and styles,
spacing between tokens, etc.
Web-based annotation interface to label segment according to their role
Manual annotation is then used to train a deep learning model for automatic
textbook extraction
9

Manual annotation provides a dataset composed by 167 pages of one textbook
train: 120 pages
validation: 17 pages
test: 30 pages
+ 30 pages of a textbook of the same collection
+ 30 pages of a textbook of a different collection
token is associated with a coarse-grained page region label: discipline, chapter,
heading, introductory activity, lesson, exercise, page number
10

LayoutLM inspired for educational French
LiLT combined with the french language model CamenBERT, fine-tuned on
textbooks and reading materials
Pages longer than 512 tokens encoded with 2 overlapping segments
2 predictions are generated for each segment and aligned for the entire page
if different:
Ø re-encoded with additional context tokens on the left and right sides and passed
through the model
Ø the 3 predictions for the overlapping part are merged using a majority vote
11

12

Good accuracy but:
• coarse-grained labels
• activities that necessitate a shift in the mode of interaction require more in-depth
extraction
13

Conclusion and future work
Contributions:
• Definition of layout- and activity-based textbook models
• Automatic extraction of textbook content (coarse-grained)
Limitations and future work:
• variation in textbooks collection
• only French language study textbooks
• fine-grained extraction
14
Data augmentation and
generation

References
[1] I. Alpizar-Chacon et al. Transformation of PDF textbooks into intelligent educational resources. In
Proceedings of the 2nd International Workshop on Inteligent Textbooks, 21st International Conference on
Artificial Intelligence in Education, 2020.
[2] Z.-M. Deng, et al. Digitalization of Electronic Textbook Based on OPENCV. In Proceedings of the International
Conference on Machine Learning and Cybernetics, 2020.
[3] L.-L. Stahn et al. Using TEI for textbook research. In Proceedings
of the Workshop on Language Technology Resources and Tools for Digital Humanities
(LT4DH), 2016.
[4] F.-M. Gérard et al. Des manuels scolaires pour apprendre: concevoir, évaluer,
utiliser, De Boeck Supérieur, 2009.

Textbook modeling
A unique identifier (@id) and positional
attributes (@xmin, @ymin, @xmax, @ymax) are
assigned to all elements.
Attributes in italics correspond to layout and
style characteristics added for the page
extraction mode

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction

Recommended

Recommended

More Related Content

Similar to Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction

Similar to Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction (20)

More from Sergey Sosnovsky

More from Sergey Sosnovsky (20)

Recently uploaded

Recently uploaded (20)

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction