Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

Lorenzo Pozzi, Isaac Alpizar-Chacon,
Sergey Sosnovsky
Harnessing Textbooks for High-Quality
Labelled Data: An Approach to Automatic
Keyword Extraction

2
Main idea
• Automated keyword extraction
• Supervised
• Unsupervised
• Deep learning based
• Modern deep learning methods based in the transformer technology are quite effective
• A Large Language Model is pre-trained on a a global corpus of general documents
• Then it is fine-tuned for domain– and context-specific tasks with a smaller corpus of
labelled data
• Where to get these corpora of labelled data for model re-training?

Standard Dataset Annotation is not Feasible…
For three reasons:
Long time of production
Domain experts are well-paid
Big amounts of data is needed

Domain-oriented: focus on narrow and cohesive domains
High-quality: written by experts
Purposeful: written to explain the domain in details and cover its
important parts
Textbooks are:
Can we use Textbooks Instead?
Provided with labelled data, textbooks
should be usable to fine-tune LLMs at scale

Part 1: Automated Extraction of Knowledge
Models from Textbooks

Textbooks as a source of (extractable) knowledge
• Focus (narrow, cohesive domain)
• Quality (created by domain experts)
• Purpose (content explains domain knowledge to a novice)
6
• sections / subsections
Structure
• easy to complex
Order
• ..of content and headers
Formatting
• indices
• tables of content
Additional
structural
elements
•Underlying content
•Textual Labels
Topics/subtopics
•Prerequisites <-> outcomes
Pedagogical
relations
•header vs important vs regular
•same format = same role
Text types/roles
and relations
•Glossary of curated meaningful terms
•Set of important domain categories
Meaningful labels
• If automatically extracted and formally represented
these elements will form the model of the textbook and
the model of the domain as the author understands it

Potential Problems of These Models
• Structure
• Labels
• Order
• Focus
• Coverage
Variability
7
• Same domain + Different authors =
Different textbooks =>
Different models
Subjectivity
• Completeness
• Granularity
• Consistency
Quality
• More structure than knowledge
• Lack of links
• Cohesiveness of topics and index terms
Lack of
semantics

Model Extraction
from PDF Textbooks
8
Extraction
Linking &
Enrichment
Integration
Domain
validation
Concept
validation
Formalisation
Isaac Alpizar Chacon

9
Evaluation 1: Model extraction
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
Domains:
• Statistics (40 textbooks)
• Computer Science (5)
• History (5)
• Literature (5)
• Rule-based process capturing common practices of textbook structuring and formatting
1. Parsing of text fragments and formatting styles
2. Construction of the style library
3. Role labeling of fragments (regular text, important text, headings, subheadings)
4. Extraction of structural elements (ToC, Index, (sub)Sections, auxiliaries)
Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings
of DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.

Evaluation 2: Model Linking to DBpedia
• - is a global knowledge graph that provides our model with a frame of reference
1. Parsing textbook index into a glossary
2. Linking the core set of glossary terms (unambiguous matching, high similarity of associated texts)
3. Gradual expansion of the linked set through candidate resource disambiguation
(using associated texts and links to already matched terms)
10
Statistics#1 Statistics#2 Information Retrieval
Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM
Hypertext'2019: 30th International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press.
Question: Are the index terms linked to the right DBpedia resources?
• Ground truth was created manually

Evaluation 3: Aggregation of Textbook Models
• Question: Would aggregation of additional textbooks move the model closer to more
complete/objective domain model (include more relevant resources)?
• Ground truth: constructed based on the Glossary of statistical terms
• > 1000 terms
• Task: compare the matching between textbooks and DBpedia with the “ideal” matching between the
Glossary and the DBpedia
11
Average single textbook Average 5 textbooks 10 textbooks
Alpizar-Chacon, I., & Sosnovsky, S.(2021). Knowledge models from PDF textbooks. New Review of Hypermedia and Multimedia, 27(1), (1-49).

Evaluation 4: Domain Specificity
core-domain: key terms
in-domain: additional terms
related-domain: terms in related domains
out-of-domain: terms not related
(pedagogical reasons)
12
Precision Recall Accuracy
in-domain+ 97% 90% 92% ***
other-domain 83% 95%
in-domain+ | other-domain
in-domain+ 94% 98% 93% ***
related-domain 89% 94%
out-of-domain 98% 88%
in-domain+ | related-domain | out-of-domain
core-domain | in-domain | other-domain
Statistic
s
Statistic
s
Philosophy
core-domain 90% 31% 76% ***
in-domain 70% 85%
other-domain 83% 95%
Alpizar-Chacon, I., & Sosnovsky, S. (2022). What's in an Index: Extracting Domain-Specific Knowledge
Graphs from Textbooks. In Proceedings of WWW '22: ACM Web Conference 2022 (pp. 966–976). New
York, NY, USA: ACM Press.

13
Evaluation 5: Concept quality
• Question: do the extracted models consist of cohesive,
cognitively-valid domain concepts?
• Method: Learning Curves
• Power law of practice: The error rate while applying a
skill/concept decreases as a power function of the number of
attempts to apply it: 𝐸𝑟𝑟𝑜𝑟𝑅𝑎𝑡𝑒 = 𝛽 ∗ 𝐴𝑡𝑡𝑒𝑚𝑝𝑡−𝛼
• External dataset in the domain of Python: student solving
programming exercises
• 3 Python textbooks provided a joint model
• The model was used to (manually) annotate the exercises
• Learning curves were plotted for every concept
• Out of 46 used concepts, 41 resulted in learning curves
with a positive slope (learning took place)
• The Average R2 (fitting coefficient) for these 41 concepts
is 0.65 (comparable to manually crafted domain models)
Alpizar-Chacon, I., Sosnovsky, S., & Brusilovsky, P. (2023). Measuring the Quality of Domain Models Extracted from Textbooks with Learning Curves Analysis.
In Proceedings of AIED'2023: 24th International Conference on Artificial Intelligence in Education (Vol. 1, pp. 804-809). Berlin/Heidelberg, Germany: Springer.

Part 2: Using Textbooks Models to Fine-tune
LLMs for Domain-oriented AKE

Domain: Statistics
Corpus - 9 PDF textbooks:
• A Concise Guide to Statistics (Kaltenbach, 2007)
• A Modern Introduction to Probability and Statistics (Dekking et al., 2005)
• Modern Mathematical Statistics with Applications (Jay L. Devore, 2014)
• OpenIntro statistics (Diez, Barr, and Cetinkaya-Rundel, 2012)
• Statistics and Probability Theory (Faber, 2012)
• Statistics for Non-Statisticians (Madsen, 2016)
• Probability and statistics for engineers and scientists (Walpole et al., 1993)
• Statistics for scientists and engineers (Shanmugam and Chattamvelli, 2015)
• Introductory statistics with R (Dalgaard, 2020)

Dataset Construction
One row in the dataset

IndexBERT
BERT
BiLSTM
The arithmetic measures …
Linear Classifier
Key
mean
Key NoKey
NoKey …
Sequence
of words
Bi-LSTM
BERT
contextualised
word
embeddings
Feed-forward
layer
1
2
3
4

Experiment
1: Keyphrase
Extraction • Can IndexBERT outperform other approaches used for
keyword/keyphrase extraction?
• Cross-validated
• 8 textbooks - for fine-tuning IndexBERT
• 1 textbook – testing
• Local Recall – looks only at the keyphrases present in
the textbook currently used as a test set

Experiment 2:
From Lexical to Semantic Modeling
gaussian distribution
normal distribution
Space X
covariance
test statistica
Napoleon III
penicillin
gaussian distribution
normal distribution
Space X
covariance
test statistica
Napoleon III
penicillin
Space X
penicillin Napoleon III
STATISTICS
Before Fine-tuning After Fine-tuning
• Do BERT embeddings of domain-related and domain-unrelated key-phrases change differently after fine-tuning?

Experiment 2:
From Lexical
to Semantic
Modeling
BETWEEN DOMAIN
RELATED TERMS
BETWEEN DOMAIN RELATED
AND OUT OF DOMAIN TERMS
BEFORE
TRAINING
AFTER
TRAINING
• We compare the similarity
scores of in-domain and out-of-
domain key-phrases
• Fine-tuning pulls-in in-domain
key-phrases
• Fine-tuning pushes-out out-
of-domain key-phrases

Questions?
Model extraction service: https://intextbooks.science.uu.nl/
Model extraction code: https://github.com/isaacalpizar/IntelligentTextbooks

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

Recommended

Recommended

More Related Content

Similar to Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

Similar to Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction (20)

More from Sergey Sosnovsky

More from Sergey Sosnovsky (20)

Recently uploaded

Recently uploaded (20)

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

Editor's Notes