Isaac Alpizar-Chacon, Max van der Hart, Zef S. Wiersma, Lorenzo S.J. Theunissen,
and Sergey Sosnovsky
Transformation of PDF Textbooks into
Interactive Educational Resources
27-7-2020
Motivation
Digital textbooks are a standard medium to distribute educational
content
Most digital textbooks are digital copies of their printed counterparts
Creation of intelligent textbooks requires effort and expertise
37-7-2020
47-7-2020
The INTelligent
TEXTBOOKS
system
complete transformation of
PDF textbooks into online
intelligent educational
resources
57-7-2020
67-7-2020
The Intextbooks system
Extracts a semantic
model from a PDF
textbook
Converts the PDF
textbook into an
HTML/CSS
representation
Enriches the HTML
with a fine-grained
DOM (Document
Object Model)
connected to the
semantic model
Every
content/layout/structure
element of the textbook
is identifiable in the
DOM.
Any object of a textbook
can become an object of
targeted interaction
Every element of a
textbook’s DOM is also
identifiable within the
semantic model extracted
from the textbook
Intelligent
Textbooks
Each textbook becomes an integrated
resource where content elements and
pieces of domain knowledge are
interlinked on both the presentation and
the knowledge levels
77-7-2020
Architecture:
Ingestion (offline)
Architecture:
Web-reader (online)
Textbook model extractor
rule-base
approach
7 steps 17 tasks 55 rules
127-7-2020
TEI Textbook Model
Structure
(sections)
Content (words, lines,
titles, etc)
Domain
Knowledge
(terms)
+ RDFa attributes
137-7-2020
PDF to HTML converter
• Several open libraries available:
• pdf2htmlEX, PDFMiner, pdf2html, Xpdf, etc.
• pdf2htmlEX:
• preserves the layout perfectly across very different types of documents
• produces the same structure across different documents
• fast, stable, and scalable
147-7-2020
TEI-HTML synchronizer
157-7-2020
TEI-HTML synchronizer
177-7-2020
Other components (planned)
Student model: keeps an internal representation of the learning
progress of the students
Monitoring engine: logs every action of the students
Adaptation engine: uses the student model and the activity log to
provide personalization to the students
187-7-2020
Validation
Test the accuracy of the matching algorithm for the TEI-HTML synchronization
70 university-level textbooks
domains: statistics, computer
science, web programming,
literature, history
3 versions of the algorithm:
use of a threshold to sort and
merge close lines (subscripts
and superscripts)
evaluation metric: percentage
of words that were matched
between the TEI and HTML
representations
197-7-2020
Results
No threshold:
87.16%
Fixed
threshold:
88.76%
Dynamic
threshold:
87.09%
207-7-2020
Analysis
Results very similar across variants of the algorithm
Textbooks consisting mostly of only text, get very high matching rate ( ~
100%)
Textbooks with figures, tables, graphs get lower matching rate
Words with subscripts and superscript are not matched correctly
217-7-2020
Summary
• We have presented the Intextbooks system:
• extract high-quality semantic models of the textbooks
• create HTML representations of the same textbooks that are connected to
their semantic models using fine-grained DOM structures
• match around 88% of all the words in the textbooks to individual elements in
the HTML resource
• we have designed a web interface to interact with the textbooks
• we have plans to incorporate a student model and an adaptation mechanism
227-7-2020
Future work
• Extend and improve the components of the system:
• improve the matching algorithm
• add the missing components
• Better define the semantics of knowledge that is extracted from the textbooks,
and potential applications
• Test the textbook modeling technology with different textbook formattings
across different domains (e.g., medicine)
• Evaluate the system’s effectiveness in a user study with real students from a
target group
Transformation of PDF Textbooks into Interactive Educational Resources

Transformation of PDF Textbooks into Interactive Educational Resources

  • 1.
    Isaac Alpizar-Chacon, Maxvan der Hart, Zef S. Wiersma, Lorenzo S.J. Theunissen, and Sergey Sosnovsky Transformation of PDF Textbooks into Interactive Educational Resources
  • 2.
    27-7-2020 Motivation Digital textbooks area standard medium to distribute educational content Most digital textbooks are digital copies of their printed counterparts Creation of intelligent textbooks requires effort and expertise
  • 3.
  • 4.
  • 5.
    The INTelligent TEXTBOOKS system complete transformationof PDF textbooks into online intelligent educational resources 57-7-2020
  • 6.
    67-7-2020 The Intextbooks system Extractsa semantic model from a PDF textbook Converts the PDF textbook into an HTML/CSS representation Enriches the HTML with a fine-grained DOM (Document Object Model) connected to the semantic model Every content/layout/structure element of the textbook is identifiable in the DOM. Any object of a textbook can become an object of targeted interaction Every element of a textbook’s DOM is also identifiable within the semantic model extracted from the textbook
  • 7.
    Intelligent Textbooks Each textbook becomesan integrated resource where content elements and pieces of domain knowledge are interlinked on both the presentation and the knowledge levels 77-7-2020
  • 8.
  • 9.
  • 10.
  • 12.
    127-7-2020 TEI Textbook Model Structure (sections) Content(words, lines, titles, etc) Domain Knowledge (terms) + RDFa attributes
  • 13.
    137-7-2020 PDF to HTMLconverter • Several open libraries available: • pdf2htmlEX, PDFMiner, pdf2html, Xpdf, etc. • pdf2htmlEX: • preserves the layout perfectly across very different types of documents • produces the same structure across different documents • fast, stable, and scalable
  • 14.
  • 15.
  • 17.
    177-7-2020 Other components (planned) Studentmodel: keeps an internal representation of the learning progress of the students Monitoring engine: logs every action of the students Adaptation engine: uses the student model and the activity log to provide personalization to the students
  • 18.
    187-7-2020 Validation Test the accuracyof the matching algorithm for the TEI-HTML synchronization 70 university-level textbooks domains: statistics, computer science, web programming, literature, history 3 versions of the algorithm: use of a threshold to sort and merge close lines (subscripts and superscripts) evaluation metric: percentage of words that were matched between the TEI and HTML representations
  • 19.
  • 20.
    207-7-2020 Analysis Results very similaracross variants of the algorithm Textbooks consisting mostly of only text, get very high matching rate ( ~ 100%) Textbooks with figures, tables, graphs get lower matching rate Words with subscripts and superscript are not matched correctly
  • 21.
    217-7-2020 Summary • We havepresented the Intextbooks system: • extract high-quality semantic models of the textbooks • create HTML representations of the same textbooks that are connected to their semantic models using fine-grained DOM structures • match around 88% of all the words in the textbooks to individual elements in the HTML resource • we have designed a web interface to interact with the textbooks • we have plans to incorporate a student model and an adaptation mechanism
  • 22.
    227-7-2020 Future work • Extendand improve the components of the system: • improve the matching algorithm • add the missing components • Better define the semantics of knowledge that is extracted from the textbooks, and potential applications • Test the textbook modeling technology with different textbook formattings across different domains (e.g., medicine) • Evaluate the system’s effectiveness in a user study with real students from a target group

Editor's Notes

  • #11 Goal: extract and enrich a semantic model of a textbook
  • #14 Goal: create an HTML that keeps the visual layout of the PDF textbooks
  • #15 Goal: create the fine-grained DOM structure of the HTML that is matched to the elements in the textbook model
  • #16 Goal: create the fine-grained DOM structure of the HTML that is matched to the elements in the textbook model
  • #17 Goal: allow students to engage and interact with the textbook in multiple ways
  • #23 Less formal domains (e.g., medicine) or domains with conflicting viewpoints (e.g., history)