SlideShare a Scribd company logo
1 of 18
Download to read offline
Layout- and Activity-based Textbook
Modeling for Automatic PDF Textbook
Extraction
Élise Lincker, Olivier Pons, Camille Guinaudeau, Isabelle Barbet, Jérôme
Dupire, CĂŠline Hudelot, Vincent Mousseau and Caroline Huron
guinaudeau@limsi.fr
MAnueLs INclusifs project
Paper textbooks remain prevalent in
schools in France
Ensuring accessible textbooks for
children with disabilities is essential
for inclusive education
Textbook adaptation is done manually
by NGO
Developmental Coordination Disorder
(DCD)
Ø impairment in motor coordination
Ø eye movement disorders
2
Automatic adaptation of textbooks
Limited work on extraction and automatic analysis of textbooks
• Transformation of PDF textbooks into intelligent educational resources[1]
transforms PDF textbook into interactive digital version
based on formal structure and hierarchy modeling
• Digitalization of Electronic Textbook Based on OPENCV[2]
identification and cutting of target areas followed by OCR approach
specifically developed for textbooks
o Extraction step relies in rules
3
Approach
01
Definition of
layout- and
activity-based
textbook models
02
Automatic
extraction of
textbook content
03
Classification of
exercises
according to their
adaptation to
DCD
04
Automatic
adaptation of
exercises
4
Approach
01
Definition of
layout- and
activity-based
textbook models
02
Automatic
extraction of
textbook content
03
Classification of
exercises
according to their
adaptation to
DCD
04
Automatic
adaptation of
exercises
CBMI
4
Textbook modeling
Textbook modeling is fundamental for textbook segmentation research
Markup schemes: HTML, DocBook and Text Encoding Initiative (TEI)
Proposed models adopt a generic structure (headings, sections, body text,
paragraphs)[3]
MALIN project:
Description of all the types of instructional activities present in textbooks
Inspiration: conceptual guides for the elaboration of textbooks[4]
5
Textbook modeling
6
Observation of dozens of
elementary school textbooks
Model inspired by:
Ø existing models
Ø guidelines for textbook
creation
Ø parsing of textbook
collections to list and
group nestings of all
elements
Textbook modeling
The first model captures the encapsulation of elements
Additional information is represented using an “indicator” tag
<lesson type="vocabulary"…>
Smallest linguistic units : tokens <token sep="space">
Example: c…bat, 4 x … = 8, Manon has lost … cat
Ă  Grouped into text segments
List and table: 2 or more lists can be linked
Possible incorporation of additional semantic and morpho-syntactic attributes
7
Textbook modeling
Second model: content extraction task is performed at the page level
Each token and segment tags is assigned position and style attributes
<token sep="space" font="arial">
chapter, theme or discipline titles constitute an element on their own separate
from activities
Ă associated with potential indicators <indicator type="revision">
Consistency with previous research
Our format can be converted to DocBook and TEI
8
Automatic content extraction
Textbooks are parsed to an XML file with pdfalto and MuPDF
Ă  extraction of words along with their font style and spatial coordinates, as well as
images
Ă  words are grouped into text segments based on rules on font sizes and styles,
spacing between tokens, etc.
Web-based annotation interface to label segment according to their role
Manual annotation is then used to train a deep learning model for automatic
textbook extraction
9
Automatic content extraction
Manual annotation provides a dataset composed by 167 pages of one textbook
train: 120 pages
validation: 17 pages
test: 30 pages
+ 30 pages of a textbook of the same collection
+ 30 pages of a textbook of a different collection
token is associated with a coarse-grained page region label: discipline, chapter,
heading, introductory activity, lesson, exercise, page number
10
Automatic content extraction
LayoutLM inspired for educational French
LiLT combined with the french language model CamenBERT, fine-tuned on
textbooks and reading materials
Pages longer than 512 tokens encoded with 2 overlapping segments
2 predictions are generated for each segment and aligned for the entire page
if different:
Ø re-encoded with additional context tokens on the left and right sides and passed
through the model
Ø the 3 predictions for the overlapping part are merged using a majority vote
11
Automatic content extraction
12
Automatic content extraction
Good accuracy but:
• coarse-grained labels
• activities that necessitate a shift in the mode of interaction require more in-depth
extraction
13
Conclusion and future work
Contributions:
• Definition of layout- and activity-based textbook models
• Automatic extraction of textbook content (coarse-grained)
Limitations and future work:
• variation in textbooks collection
• only French language study textbooks
• fine-grained extraction
14
Data augmentation and
generation
Questions?
References
[1] I. Alpizar-Chacon et al. Transformation of PDF textbooks into intelligent educational resources. In
Proceedings of the 2nd International Workshop on Inteligent Textbooks, 21st International Conference on
Artificial Intelligence in Education, 2020.
[2] Z.-M. Deng, et al. Digitalization of Electronic Textbook Based on OPENCV. In Proceedings of the International
Conference on Machine Learning and Cybernetics, 2020.
[3] L.-L. Stahn et al. Using TEI for textbook research. In Proceedings
of the Workshop on Language Technology Resources and Tools for Digital Humanities
(LT4DH), 2016.
[4] F.-M. GĂŠrard et al. Des manuels scolaires pour apprendre: concevoir, ĂŠvaluer,
utiliser, De Boeck SupĂŠrieur, 2009.
Textbook modeling
A unique identifier (@id) and positional
attributes (@xmin, @ymin, @xmax, @ymax) are
assigned to all elements.
Attributes in italics correspond to layout and
style characteristics added for the page
extraction mode

More Related Content

Similar to Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction

Extraction of Knowledge Models from Textbooks
Extraction of Knowledge Models from TextbooksExtraction of Knowledge Models from Textbooks
Extraction of Knowledge Models from TextbooksIsaac Alpizar-Chacon
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartachoeMadrid network
 
Transformation of PDF Textbooks into Interactive Educational Resources
Transformation of PDF Textbooks into Interactive Educational ResourcesTransformation of PDF Textbooks into Interactive Educational Resources
Transformation of PDF Textbooks into Interactive Educational ResourcesIsaac Alpizar-Chacon
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersScott Bou
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
 
Using oer for cambodia
Using oer for cambodiaUsing oer for cambodia
Using oer for cambodiawon ho
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbookSergey Sosnovsky
 
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTS
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTSAN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTS
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTSijdms
 
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...Michael Derntl
 
Components of an Online Program by FAO
Components of an Online Program by FAOComponents of an Online Program by FAO
Components of an Online Program by FAOAlexander Lopez Diaz
 
IEEE FIE 2008 Saratoga Paper 1197
IEEE FIE 2008 Saratoga  Paper 1197IEEE FIE 2008 Saratoga  Paper 1197
IEEE FIE 2008 Saratoga Paper 1197Miguel R. Artacho
 
An in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classificationAn in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classificationjournalBEEI
 
A Reappraisal Of Online Mathematics Teaching Using LaTeX
A Reappraisal Of Online Mathematics Teaching Using LaTeXA Reappraisal Of Online Mathematics Teaching Using LaTeX
A Reappraisal Of Online Mathematics Teaching Using LaTeXBryce Nelson
 
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...Katie Naple
 
Publishing and Education Service on the Open Web Platform
Publishing and Education Service on the Open Web PlatformPublishing and Education Service on the Open Web Platform
Publishing and Education Service on the Open Web PlatformOpen Cyber University of Korea
 
Ontology-based Semantic Approach for Learning Object Recommendation
Ontology-based Semantic Approach for Learning Object RecommendationOntology-based Semantic Approach for Learning Object Recommendation
Ontology-based Semantic Approach for Learning Object RecommendationIDES Editor
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Aliabbas Petiwala
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 

Similar to Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction (20)

Extraction of Knowledge Models from Textbooks
Extraction of Knowledge Models from TextbooksExtraction of Knowledge Models from Textbooks
Extraction of Knowledge Models from Textbooks
 
2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho2010-04-14 EDUCON eMadrid uned mrartacho
2010-04-14 EDUCON eMadrid uned mrartacho
 
Transformation of PDF Textbooks into Interactive Educational Resources
Transformation of PDF Textbooks into Interactive Educational ResourcesTransformation of PDF Textbooks into Interactive Educational Resources
Transformation of PDF Textbooks into Interactive Educational Resources
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research Papers
 
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Using oer for cambodia
Using oer for cambodiaUsing oer for cambodia
Using oer for cambodia
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTS
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTSAN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTS
AN E XAMINATION OF T HE E FFECTIVENESS OF T EACHING D ATA M ODELLING C ONCEPTS
 
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...
Propelling Standards-based Sharing and Reuse in Instructional Modeling Commun...
 
EDUPUB Tokyo 2014 day2 Paul Belfanti and Markus Gylling
EDUPUB Tokyo 2014 day2 Paul Belfanti and Markus GyllingEDUPUB Tokyo 2014 day2 Paul Belfanti and Markus Gylling
EDUPUB Tokyo 2014 day2 Paul Belfanti and Markus Gylling
 
Components of an Online Program by FAO
Components of an Online Program by FAOComponents of an Online Program by FAO
Components of an Online Program by FAO
 
IEEE FIE 2008 Saratoga Paper 1197
IEEE FIE 2008 Saratoga  Paper 1197IEEE FIE 2008 Saratoga  Paper 1197
IEEE FIE 2008 Saratoga Paper 1197
 
An in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classificationAn in-depth exploration of Bangla blog post classification
An in-depth exploration of Bangla blog post classification
 
A Reappraisal Of Online Mathematics Teaching Using LaTeX
A Reappraisal Of Online Mathematics Teaching Using LaTeXA Reappraisal Of Online Mathematics Teaching Using LaTeX
A Reappraisal Of Online Mathematics Teaching Using LaTeX
 
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
Automatic Annotation Of Incomplete And Scattered Bibliographical References I...
 
Publishing and Education Service on the Open Web Platform
Publishing and Education Service on the Open Web PlatformPublishing and Education Service on the Open Web Platform
Publishing and Education Service on the Open Web Platform
 
Ontology-based Semantic Approach for Learning Object Recommendation
Ontology-based Semantic Approach for Learning Object RecommendationOntology-based Semantic Approach for Learning Object Recommendation
Ontology-based Semantic Approach for Learning Object Recommendation
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 

More from Sergey Sosnovsky

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Sergey Sosnovsky
 
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Sergey Sosnovsky
 
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Sergey Sosnovsky
 
Creating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsCreating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsSergey Sosnovsky
 
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Sergey Sosnovsky
 
Interactions of reading and assessment activities
Interactions of reading and assessment activitiesInteractions of reading and assessment activities
Interactions of reading and assessment activitiesSergey Sosnovsky
 
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Sergey Sosnovsky
 
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationYAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationSergey Sosnovsky
 
Automatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringAutomatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringSergey Sosnovsky
 
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersReading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersSergey Sosnovsky
 
Mathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsMathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition GenerationSergey Sosnovsky
 
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Sergey Sosnovsky
 
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsGeneration of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsSergey Sosnovsky
 
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Sergey Sosnovsky
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningSergey Sosnovsky
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Sergey Sosnovsky
 
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Sergey Sosnovsky
 
Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Sergey Sosnovsky
 
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...Sergey Sosnovsky
 

More from Sergey Sosnovsky (20)

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
 
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
 
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
 
Creating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsCreating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event Streams
 
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
 
Interactions of reading and assessment activities
Interactions of reading and assessment activitiesInteractions of reading and assessment activities
Interactions of reading and assessment activities
 
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
 
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationYAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
 
Automatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringAutomatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware Engineering
 
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersReading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
 
Mathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsMathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree Embeddings
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
 
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
 
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsGeneration of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
 
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content
 
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
 
Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages
 
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...
Student Modeling with Automatic Knowledge Component Extraction for Adaptive T...
 

Recently uploaded

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Patrick Diehl
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxpriyankatabhane
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Masticationvidulajaib
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -INandakishor Bhaurao Deshmukh
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)DHURKADEVIBASKAR
 

Recently uploaded (20)

Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?Is RISC-V ready for HPC workload? Maybe?
Is RISC-V ready for HPC workload? Maybe?
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Speech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptxSpeech, hearing, noise, intelligibility.pptx
Speech, hearing, noise, intelligibility.pptx
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 
Temporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of MasticationTemporomandibular joint Muscles of Mastication
Temporomandibular joint Muscles of Mastication
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Volatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -IVolatile Oils Pharmacognosy And Phytochemistry -I
Volatile Oils Pharmacognosy And Phytochemistry -I
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)Recombinant DNA technology( Transgenic plant and animal)
Recombinant DNA technology( Transgenic plant and animal)
 

Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction

  • 1. Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extraction Élise Lincker, Olivier Pons, Camille Guinaudeau, Isabelle Barbet, JĂŠrĂ´me Dupire, CĂŠline Hudelot, Vincent Mousseau and Caroline Huron guinaudeau@limsi.fr
  • 2. MAnueLs INclusifs project Paper textbooks remain prevalent in schools in France Ensuring accessible textbooks for children with disabilities is essential for inclusive education Textbook adaptation is done manually by NGO Developmental Coordination Disorder (DCD) Ø impairment in motor coordination Ø eye movement disorders 2
  • 3. Automatic adaptation of textbooks Limited work on extraction and automatic analysis of textbooks • Transformation of PDF textbooks into intelligent educational resources[1] transforms PDF textbook into interactive digital version based on formal structure and hierarchy modeling • Digitalization of Electronic Textbook Based on OPENCV[2] identification and cutting of target areas followed by OCR approach specifically developed for textbooks o Extraction step relies in rules 3
  • 4. Approach 01 Definition of layout- and activity-based textbook models 02 Automatic extraction of textbook content 03 Classification of exercises according to their adaptation to DCD 04 Automatic adaptation of exercises 4
  • 5. Approach 01 Definition of layout- and activity-based textbook models 02 Automatic extraction of textbook content 03 Classification of exercises according to their adaptation to DCD 04 Automatic adaptation of exercises CBMI 4
  • 6. Textbook modeling Textbook modeling is fundamental for textbook segmentation research Markup schemes: HTML, DocBook and Text Encoding Initiative (TEI) Proposed models adopt a generic structure (headings, sections, body text, paragraphs)[3] MALIN project: Description of all the types of instructional activities present in textbooks Inspiration: conceptual guides for the elaboration of textbooks[4] 5
  • 7. Textbook modeling 6 Observation of dozens of elementary school textbooks Model inspired by: Ø existing models Ø guidelines for textbook creation Ø parsing of textbook collections to list and group nestings of all elements
  • 8. Textbook modeling The first model captures the encapsulation of elements Additional information is represented using an “indicator” tag <lesson type="vocabulary"…> Smallest linguistic units : tokens <token sep="space"> Example: c…bat, 4 x … = 8, Manon has lost … cat Ă  Grouped into text segments List and table: 2 or more lists can be linked Possible incorporation of additional semantic and morpho-syntactic attributes 7
  • 9. Textbook modeling Second model: content extraction task is performed at the page level Each token and segment tags is assigned position and style attributes <token sep="space" font="arial"> chapter, theme or discipline titles constitute an element on their own separate from activities Ă associated with potential indicators <indicator type="revision"> Consistency with previous research Our format can be converted to DocBook and TEI 8
  • 10. Automatic content extraction Textbooks are parsed to an XML file with pdfalto and MuPDF Ă  extraction of words along with their font style and spatial coordinates, as well as images Ă  words are grouped into text segments based on rules on font sizes and styles, spacing between tokens, etc. Web-based annotation interface to label segment according to their role Manual annotation is then used to train a deep learning model for automatic textbook extraction 9
  • 11. Automatic content extraction Manual annotation provides a dataset composed by 167 pages of one textbook train: 120 pages validation: 17 pages test: 30 pages + 30 pages of a textbook of the same collection + 30 pages of a textbook of a different collection token is associated with a coarse-grained page region label: discipline, chapter, heading, introductory activity, lesson, exercise, page number 10
  • 12. Automatic content extraction LayoutLM inspired for educational French LiLT combined with the french language model CamenBERT, fine-tuned on textbooks and reading materials Pages longer than 512 tokens encoded with 2 overlapping segments 2 predictions are generated for each segment and aligned for the entire page if different: Ø re-encoded with additional context tokens on the left and right sides and passed through the model Ø the 3 predictions for the overlapping part are merged using a majority vote 11
  • 14. Automatic content extraction Good accuracy but: • coarse-grained labels • activities that necessitate a shift in the mode of interaction require more in-depth extraction 13
  • 15. Conclusion and future work Contributions: • Definition of layout- and activity-based textbook models • Automatic extraction of textbook content (coarse-grained) Limitations and future work: • variation in textbooks collection • only French language study textbooks • fine-grained extraction 14 Data augmentation and generation
  • 17. References [1] I. Alpizar-Chacon et al. Transformation of PDF textbooks into intelligent educational resources. In Proceedings of the 2nd International Workshop on Inteligent Textbooks, 21st International Conference on Artificial Intelligence in Education, 2020. [2] Z.-M. Deng, et al. Digitalization of Electronic Textbook Based on OPENCV. In Proceedings of the International Conference on Machine Learning and Cybernetics, 2020. [3] L.-L. Stahn et al. Using TEI for textbook research. In Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), 2016. [4] F.-M. GĂŠrard et al. Des manuels scolaires pour apprendre: concevoir, ĂŠvaluer, utiliser, De Boeck SupĂŠrieur, 2009.
  • 18. Textbook modeling A unique identifier (@id) and positional attributes (@xmin, @ymin, @xmax, @ymax) are assigned to all elements. Attributes in italics correspond to layout and style characteristics added for the page extraction mode