SlideShare a Scribd company logo
1 of 21
Lorenzo Pozzi, Isaac Alpizar-Chacon,
Sergey Sosnovsky
Harnessing Textbooks for High-Quality
Labelled Data: An Approach to Automatic
Keyword Extraction
2
Main idea
• Automated keyword extraction
• Supervised
• Unsupervised
• Deep learning based
• Modern deep learning methods based in the transformer technology are quite effective
• A Large Language Model is pre-trained on a a global corpus of general documents
• Then it is fine-tuned for domain– and context-specific tasks with a smaller corpus of
labelled data
• Where to get these corpora of labelled data for model re-training?
Standard Dataset Annotation is not Feasible…
For three reasons:
Long time of production
Domain experts are well-paid
Big amounts of data is needed
Domain-oriented: focus on narrow and cohesive domains
High-quality: written by experts
Purposeful: written to explain the domain in details and cover its
important parts
Textbooks are:
Can we use Textbooks Instead?
Provided with labelled data, textbooks
should be usable to fine-tune LLMs at scale
Part 1: Automated Extraction of Knowledge
Models from Textbooks
Textbooks as a source of (extractable) knowledge
• Focus (narrow, cohesive domain)
• Quality (created by domain experts)
• Purpose (content explains domain knowledge to a novice)
6
• sections / subsections
Structure
• easy to complex
Order
• ..of content and headers
Formatting
• indices
• tables of content
Additional
structural
elements
•Underlying content
•Textual Labels
Topics/subtopics
•Prerequisites <-> outcomes
Pedagogical
relations
•header vs important vs regular
•same format = same role
Text types/roles
and relations
•Glossary of curated meaningful terms
•Set of important domain categories
Meaningful labels
• If automatically extracted and formally represented
these elements will form the model of the textbook and
the model of the domain as the author understands it
Potential Problems of These Models
• Structure
• Labels
• Order
• Focus
• Coverage
Variability
7
• Same domain + Different authors =
Different textbooks =>
Different models
Subjectivity
• Completeness
• Granularity
• Consistency
Quality
• More structure than knowledge
• Lack of links
• Cohesiveness of topics and index terms
Lack of
semantics
Model Extraction
from PDF Textbooks
8
Extraction
Linking &
Enrichment
Integration
Domain
validation
Concept
validation
Formalisation
Isaac Alpizar Chacon
9
Evaluation 1: Model extraction
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
Domains:
• Statistics (40 textbooks)
• Computer Science (5)
• History (5)
• Literature (5)
• Rule-based process capturing common practices of textbook structuring and formatting
1. Parsing of text fragments and formatting styles
2. Construction of the style library
3. Role labeling of fragments (regular text, important text, headings, subheadings)
4. Extraction of structural elements (ToC, Index, (sub)Sections, auxiliaries)
Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings
of DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.
Evaluation 2: Model Linking to DBpedia
• - is a global knowledge graph that provides our model with a frame of reference
1. Parsing textbook index into a glossary
2. Linking the core set of glossary terms (unambiguous matching, high similarity of associated texts)
3. Gradual expansion of the linked set through candidate resource disambiguation
(using associated texts and links to already matched terms)
10
Statistics#1 Statistics#2 Information Retrieval
Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM
Hypertext'2019: 30th International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press.
Question: Are the index terms linked to the right DBpedia resources?
• Ground truth was created manually
Evaluation 3: Aggregation of Textbook Models
• Question: Would aggregation of additional textbooks move the model closer to more
complete/objective domain model (include more relevant resources)?
• Ground truth: constructed based on the Glossary of statistical terms
• > 1000 terms
• Task: compare the matching between textbooks and DBpedia with the “ideal” matching between the
Glossary and the DBpedia
11
Average single textbook Average 5 textbooks 10 textbooks
Alpizar-Chacon, I., & Sosnovsky, S.(2021). Knowledge models from PDF textbooks. New Review of Hypermedia and Multimedia, 27(1), (1-49).
Evaluation 4: Domain Specificity
core-domain: key terms
in-domain: additional terms
related-domain: terms in related domains
out-of-domain: terms not related
(pedagogical reasons)
12
Precision Recall Accuracy
in-domain+ 97% 90% 92% ***
other-domain 83% 95%
in-domain+ | other-domain
Precision Recall Accuracy
in-domain+ 94% 98% 93% ***
related-domain 89% 94%
out-of-domain 98% 88%
in-domain+ | related-domain | out-of-domain
core-domain | in-domain | other-domain
Statistic
s
Statistic
s
Philosophy
Precision Recall Accuracy
core-domain 90% 31% 76% ***
in-domain 70% 85%
other-domain 83% 95%
Alpizar-Chacon, I., & Sosnovsky, S. (2022). What's in an Index: Extracting Domain-Specific Knowledge
Graphs from Textbooks. In Proceedings of WWW '22: ACM Web Conference 2022 (pp. 966–976). New
York, NY, USA: ACM Press.
13
Evaluation 5: Concept quality
• Question: do the extracted models consist of cohesive,
cognitively-valid domain concepts?
• Method: Learning Curves
• Power law of practice: The error rate while applying a
skill/concept decreases as a power function of the number of
attempts to apply it: 𝐸𝑟𝑟𝑜𝑟𝑅𝑎𝑡𝑒 = 𝛽 ∗ 𝐴𝑡𝑡𝑒𝑚𝑝𝑡−𝛼
• External dataset in the domain of Python: student solving
programming exercises
• 3 Python textbooks provided a joint model
• The model was used to (manually) annotate the exercises
• Learning curves were plotted for every concept
• Out of 46 used concepts, 41 resulted in learning curves
with a positive slope (learning took place)
• The Average R2 (fitting coefficient) for these 41 concepts
is 0.65 (comparable to manually crafted domain models)
Alpizar-Chacon, I., Sosnovsky, S., & Brusilovsky, P. (2023). Measuring the Quality of Domain Models Extracted from Textbooks with Learning Curves Analysis.
In Proceedings of AIED'2023: 24th International Conference on Artificial Intelligence in Education (Vol. 1, pp. 804-809). Berlin/Heidelberg, Germany: Springer.
Part 2: Using Textbooks Models to Fine-tune
LLMs for Domain-oriented AKE
Domain: Statistics
Corpus - 9 PDF textbooks:
• A Concise Guide to Statistics (Kaltenbach, 2007)
• A Modern Introduction to Probability and Statistics (Dekking et al., 2005)
• Modern Mathematical Statistics with Applications (Jay L. Devore, 2014)
• OpenIntro statistics (Diez, Barr, and Cetinkaya-Rundel, 2012)
• Statistics and Probability Theory (Faber, 2012)
• Statistics for Non-Statisticians (Madsen, 2016)
• Probability and statistics for engineers and scientists (Walpole et al., 1993)
• Statistics for scientists and engineers (Shanmugam and Chattamvelli, 2015)
• Introductory statistics with R (Dalgaard, 2020)
Dataset Construction
One row in the dataset
IndexBERT
BERT
BiLSTM
The arithmetic measures …
Linear Classifier
Key
mean
Key NoKey
NoKey …
Sequence
of words
Bi-LSTM
BERT
contextualised
word
embeddings
Feed-forward
layer
1
2
3
4
Experiment
1: Keyphrase
Extraction • Can IndexBERT outperform other approaches used for
keyword/keyphrase extraction?
• Cross-validated
• 8 textbooks - for fine-tuning IndexBERT
• 1 textbook – testing
• Local Recall – looks only at the keyphrases present in
the textbook currently used as a test set
Experiment 2:
From Lexical to Semantic Modeling
gaussian distribution
normal distribution
Space X
covariance
test statistica
Napoleon III
penicillin
gaussian distribution
normal distribution
Space X
covariance
test statistica
Napoleon III
penicillin
Space X
penicillin Napoleon III
STATISTICS
Before Fine-tuning After Fine-tuning
• Do BERT embeddings of domain-related and domain-unrelated key-phrases change differently after fine-tuning?
Experiment 2:
From Lexical
to Semantic
Modeling
BETWEEN DOMAIN
RELATED TERMS
BETWEEN DOMAIN RELATED
AND OUT OF DOMAIN TERMS
BEFORE
TRAINING
AFTER
TRAINING
• We compare the similarity
scores of in-domain and out-of-
domain key-phrases
• Fine-tuning pulls-in in-domain
key-phrases
• Fine-tuning pushes-out out-
of-domain key-phrases
Questions?
Model extraction service: https://intextbooks.science.uu.nl/
Model extraction code: https://github.com/isaacalpizar/IntelligentTextbooks

More Related Content

Similar to Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
 
Semantically-enabled Browsing of Large Multilingual Document Collections
Semantically-enabled Browsing of Large Multilingual Document CollectionsSemantically-enabled Browsing of Large Multilingual Document Collections
Semantically-enabled Browsing of Large Multilingual Document CollectionsCarlos Badenes-Olmedo
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxKalpit Desai
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Vrije Universiteit Amsterdam
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...infoclio.ch
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information ArchitectureScott Abel
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Aliabbas Petiwala
 
MS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxMS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxNimraTariq69
 
Semantic Techniques for Enabling Knowledge Reuse in Conceptual Modelling
Semantic Techniques for Enabling Knowledge Reuse in Conceptual ModellingSemantic Techniques for Enabling Knowledge Reuse in Conceptual Modelling
Semantic Techniques for Enabling Knowledge Reuse in Conceptual ModellingOscar Corcho
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...Artificial Intelligence Institute at UofSC
 

Similar to Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction (20)

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...
 
Semantically-enabled Browsing of Large Multilingual Document Collections
Semantically-enabled Browsing of Large Multilingual Document CollectionsSemantically-enabled Browsing of Large Multilingual Document Collections
Semantically-enabled Browsing of Large Multilingual Document Collections
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 
TopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptxTopicModels_BleiPaper_Summary.pptx
TopicModels_BleiPaper_Summary.pptx
 
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
Using Linked Data Traversal to Label Academic Communities - SAVE-SD2015
 
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
Prof. M. Thaller (Universität Köln) - Toward a reference curriculum in Digita...
 
A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries. A spatio-temporal visual analysis tool for historical dictionaries.
A spatio-temporal visual analysis tool for historical dictionaries.
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information Architecture
 
AIMS-EREA.pdf
AIMS-EREA.pdfAIMS-EREA.pdf
AIMS-EREA.pdf
 
Big Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic ProcessingBig Data Palooza Talk: Aspects of Semantic Processing
Big Data Palooza Talk: Aspects of Semantic Processing
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
 
MS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptxMS-Presentation-new template arid university.pptx
MS-Presentation-new template arid university.pptx
 
Semantic Techniques for Enabling Knowledge Reuse in Conceptual Modelling
Semantic Techniques for Enabling Knowledge Reuse in Conceptual ModellingSemantic Techniques for Enabling Knowledge Reuse in Conceptual Modelling
Semantic Techniques for Enabling Knowledge Reuse in Conceptual Modelling
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...Semantics-enhanced Cyberinfrastructure for ICMSE :  Interoperability, Analyti...
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
 

More from Sergey Sosnovsky

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Sergey Sosnovsky
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky
 
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Sergey Sosnovsky
 
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Sergey Sosnovsky
 
Creating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsCreating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsSergey Sosnovsky
 
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Sergey Sosnovsky
 
Interactions of reading and assessment activities
Interactions of reading and assessment activitiesInteractions of reading and assessment activities
Interactions of reading and assessment activitiesSergey Sosnovsky
 
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Sergey Sosnovsky
 
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationYAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationSergey Sosnovsky
 
Automatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringAutomatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringSergey Sosnovsky
 
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersReading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersSergey Sosnovsky
 
Mathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsMathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition GenerationSergey Sosnovsky
 
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Sergey Sosnovsky
 
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsGeneration of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsSergey Sosnovsky
 
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Sergey Sosnovsky
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningSergey Sosnovsky
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Sergey Sosnovsky
 
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Sergey Sosnovsky
 
Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Sergey Sosnovsky
 

More from Sergey Sosnovsky (20)

Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
Toward Eliminating Hallucinations: GPT-based Explanatory AI for Intelligent T...
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
 
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
Exploring the Content Ecosystem of the First Open-source Adaptive Tutor and i...
 
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
Advancing Intelligent Textbooks with Automatically Generated Practice: A Larg...
 
Creating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event StreamsCreating Session Data from eTextbook Event Streams
Creating Session Data from eTextbook Event Streams
 
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
Augmenting Digital Textbooks with Reusable Smart Learning Content: Solutions ...
 
Interactions of reading and assessment activities
Interactions of reading and assessment activitiesInteractions of reading and assessment activities
Interactions of reading and assessment activities
 
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
Parallel Construction: A Parallel Corpus Approach for Automatic Question Gene...
 
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for EducationYAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
YAI4Edu: an Explanatory AI to Generate Interactive e-Books for Education
 
Automatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware EngineeringAutomatic Question Generation for Evidence-based Online Courseware Engineering
Automatic Question Generation for Evidence-based Online Courseware Engineering
 
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained TransformersReading Comprehension Quiz Generation using Generative Pre-trained Transformers
Reading Comprehension Quiz Generation using Generative Pre-trained Transformers
 
Mathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree EmbeddingsMathematical Language Processing via Tree Embeddings
Mathematical Language Processing via Tree Embeddings
 
Contextual Definition Generation
Contextual Definition GenerationContextual Definition Generation
Contextual Definition Generation
 
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
Transforming Textbooks into Learning by Doing Environments: An Evaluation of ...
 
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge ModelsGeneration of Assessment Questions from Textbooks Enriched with Knowledge Models
Generation of Assessment Questions from Textbooks Enriched with Knowledge Models
 
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
Using Semantics of Textbook Highlights to Predict Student Comprehension and K...
 
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated LearningDental TutorBot: Exploitation of Dental Textbooks for Automated Learning
Dental TutorBot: Exploitation of Dental Textbooks for Automated Learning
 
Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content Using Programmed Instruction to Help Students Engage with eTextbook Content
Using Programmed Instruction to Help Students Engage with eTextbook Content
 
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
Adding Intelligence to a Textbook for Human Anatomy with a Causal Concept Map...
 
Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages Interlingua: Linking Textbooks Across Different Languages
Interlingua: Linking Textbooks Across Different Languages
 

Recently uploaded

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsssuserddc89b
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxyaramohamed343013
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trssuser06f238
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |aasikanpl
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2John Carlo Rollon
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10ROLANARIBATO3
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxSwapnil Therkar
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsHajira Mahmood
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayZachary Labe
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxFarihaAbdulRasheed
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxdharshini369nike
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxNandakishor Bhaurao Deshmukh
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsCharlene Llagas
 

Recently uploaded (20)

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
TOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physicsTOPIC 8 Temperature and Heat.pdf physics
TOPIC 8 Temperature and Heat.pdf physics
 
Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024Vision and reflection on Mining Software Repositories research in 2024
Vision and reflection on Mining Software Repositories research in 2024
 
Scheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docxScheme-of-Work-Science-Stage-4 cambridge science.docx
Scheme-of-Work-Science-Stage-4 cambridge science.docx
 
Neurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 trNeurodevelopmental disorders according to the dsm 5 tr
Neurodevelopmental disorders according to the dsm 5 tr
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Lajpat Nagar (Delhi) |
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2Evidences of Evolution General Biology 2
Evidences of Evolution General Biology 2
 
Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10Gas_Laws_powerpoint_notes.ppt for grade 10
Gas_Laws_powerpoint_notes.ppt for grade 10
 
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptxAnalytical Profile of Coleus Forskohlii | Forskolin .pptx
Analytical Profile of Coleus Forskohlii | Forskolin .pptx
 
Solution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutionsSolution chemistry, Moral and Normal solutions
Solution chemistry, Moral and Normal solutions
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Welcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work DayWelcome to GFDL for Take Your Child To Work Day
Welcome to GFDL for Take Your Child To Work Day
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptxRESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
RESPIRATORY ADAPTATIONS TO HYPOXIA IN HUMNAS.pptx
 
TOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptxTOTAL CHOLESTEROL (lipid profile test).pptx
TOTAL CHOLESTEROL (lipid profile test).pptx
 
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptxTHE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
THE ROLE OF PHARMACOGNOSY IN TRADITIONAL AND MODERN SYSTEM OF MEDICINE.pptx
 
Heredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of TraitsHeredity: Inheritance and Variation of Traits
Heredity: Inheritance and Variation of Traits
 

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic Keyword Extraction

  • 1. Lorenzo Pozzi, Isaac Alpizar-Chacon, Sergey Sosnovsky Harnessing Textbooks for High-Quality Labelled Data: An Approach to Automatic Keyword Extraction
  • 2. 2 Main idea • Automated keyword extraction • Supervised • Unsupervised • Deep learning based • Modern deep learning methods based in the transformer technology are quite effective • A Large Language Model is pre-trained on a a global corpus of general documents • Then it is fine-tuned for domain– and context-specific tasks with a smaller corpus of labelled data • Where to get these corpora of labelled data for model re-training?
  • 3. Standard Dataset Annotation is not Feasible… For three reasons: Long time of production Domain experts are well-paid Big amounts of data is needed
  • 4. Domain-oriented: focus on narrow and cohesive domains High-quality: written by experts Purposeful: written to explain the domain in details and cover its important parts Textbooks are: Can we use Textbooks Instead? Provided with labelled data, textbooks should be usable to fine-tune LLMs at scale
  • 5. Part 1: Automated Extraction of Knowledge Models from Textbooks
  • 6. Textbooks as a source of (extractable) knowledge • Focus (narrow, cohesive domain) • Quality (created by domain experts) • Purpose (content explains domain knowledge to a novice) 6 • sections / subsections Structure • easy to complex Order • ..of content and headers Formatting • indices • tables of content Additional structural elements •Underlying content •Textual Labels Topics/subtopics •Prerequisites <-> outcomes Pedagogical relations •header vs important vs regular •same format = same role Text types/roles and relations •Glossary of curated meaningful terms •Set of important domain categories Meaningful labels • If automatically extracted and formally represented these elements will form the model of the textbook and the model of the domain as the author understands it
  • 7. Potential Problems of These Models • Structure • Labels • Order • Focus • Coverage Variability 7 • Same domain + Different authors = Different textbooks => Different models Subjectivity • Completeness • Granularity • Consistency Quality • More structure than knowledge • Lack of links • Cohesiveness of topics and index terms Lack of semantics
  • 8. Model Extraction from PDF Textbooks 8 Extraction Linking & Enrichment Integration Domain validation Concept validation Formalisation Isaac Alpizar Chacon
  • 9. 9 Evaluation 1: Model extraction Text Extraction Our approach: 93.85% PDFBox: 89.72% PdfAct: 84.19% TOC Recognition Precision: 99.92% Recall: 99.92% Index Recognition Precision: 98.56% Recall: 98.13% Domains: • Statistics (40 textbooks) • Computer Science (5) • History (5) • Literature (5) • Rule-based process capturing common practices of textbook structuring and formatting 1. Parsing of text fragments and formatting styles 2. Construction of the style library 3. Role labeling of fragments (regular text, important text, headings, subheadings) 4. Extraction of structural elements (ToC, Index, (sub)Sections, auxiliaries) Alpizar-Chacon, I., & Sosnovsky, S. (2020). Order out of Chaos: Construction of Knowledge Models from PDF Textbooks. In Proceedings of DocEng’2020: The 20th ACM Symposium on Document Engineering, (Article No.: 8, pp 1–10). New York, NY, USA: ACM Press.
  • 10. Evaluation 2: Model Linking to DBpedia • - is a global knowledge graph that provides our model with a frame of reference 1. Parsing textbook index into a glossary 2. Linking the core set of glossary terms (unambiguous matching, high similarity of associated texts) 3. Gradual expansion of the linked set through candidate resource disambiguation (using associated texts and links to already matched terms) 10 Statistics#1 Statistics#2 Information Retrieval Alpizar-Chacon, I., & Sosnovsky, S. (2019). Expanding the Web of Knowledge: One Textbook at a Time. In Proceedings of ACM Hypertext'2019: 30th International Conference on Hypertext and Social Media (pp. 9-18). New York, NY, USA: ACM Press. Question: Are the index terms linked to the right DBpedia resources? • Ground truth was created manually
  • 11. Evaluation 3: Aggregation of Textbook Models • Question: Would aggregation of additional textbooks move the model closer to more complete/objective domain model (include more relevant resources)? • Ground truth: constructed based on the Glossary of statistical terms • > 1000 terms • Task: compare the matching between textbooks and DBpedia with the “ideal” matching between the Glossary and the DBpedia 11 Average single textbook Average 5 textbooks 10 textbooks Alpizar-Chacon, I., & Sosnovsky, S.(2021). Knowledge models from PDF textbooks. New Review of Hypermedia and Multimedia, 27(1), (1-49).
  • 12. Evaluation 4: Domain Specificity core-domain: key terms in-domain: additional terms related-domain: terms in related domains out-of-domain: terms not related (pedagogical reasons) 12 Precision Recall Accuracy in-domain+ 97% 90% 92% *** other-domain 83% 95% in-domain+ | other-domain Precision Recall Accuracy in-domain+ 94% 98% 93% *** related-domain 89% 94% out-of-domain 98% 88% in-domain+ | related-domain | out-of-domain core-domain | in-domain | other-domain Statistic s Statistic s Philosophy Precision Recall Accuracy core-domain 90% 31% 76% *** in-domain 70% 85% other-domain 83% 95% Alpizar-Chacon, I., & Sosnovsky, S. (2022). What's in an Index: Extracting Domain-Specific Knowledge Graphs from Textbooks. In Proceedings of WWW '22: ACM Web Conference 2022 (pp. 966–976). New York, NY, USA: ACM Press.
  • 13. 13 Evaluation 5: Concept quality • Question: do the extracted models consist of cohesive, cognitively-valid domain concepts? • Method: Learning Curves • Power law of practice: The error rate while applying a skill/concept decreases as a power function of the number of attempts to apply it: 𝐸𝑟𝑟𝑜𝑟𝑅𝑎𝑡𝑒 = 𝛽 ∗ 𝐴𝑡𝑡𝑒𝑚𝑝𝑡−𝛼 • External dataset in the domain of Python: student solving programming exercises • 3 Python textbooks provided a joint model • The model was used to (manually) annotate the exercises • Learning curves were plotted for every concept • Out of 46 used concepts, 41 resulted in learning curves with a positive slope (learning took place) • The Average R2 (fitting coefficient) for these 41 concepts is 0.65 (comparable to manually crafted domain models) Alpizar-Chacon, I., Sosnovsky, S., & Brusilovsky, P. (2023). Measuring the Quality of Domain Models Extracted from Textbooks with Learning Curves Analysis. In Proceedings of AIED'2023: 24th International Conference on Artificial Intelligence in Education (Vol. 1, pp. 804-809). Berlin/Heidelberg, Germany: Springer.
  • 14. Part 2: Using Textbooks Models to Fine-tune LLMs for Domain-oriented AKE
  • 15. Domain: Statistics Corpus - 9 PDF textbooks: • A Concise Guide to Statistics (Kaltenbach, 2007) • A Modern Introduction to Probability and Statistics (Dekking et al., 2005) • Modern Mathematical Statistics with Applications (Jay L. Devore, 2014) • OpenIntro statistics (Diez, Barr, and Cetinkaya-Rundel, 2012) • Statistics and Probability Theory (Faber, 2012) • Statistics for Non-Statisticians (Madsen, 2016) • Probability and statistics for engineers and scientists (Walpole et al., 1993) • Statistics for scientists and engineers (Shanmugam and Chattamvelli, 2015) • Introductory statistics with R (Dalgaard, 2020)
  • 17. IndexBERT BERT BiLSTM The arithmetic measures … Linear Classifier Key mean Key NoKey NoKey … Sequence of words Bi-LSTM BERT contextualised word embeddings Feed-forward layer 1 2 3 4
  • 18. Experiment 1: Keyphrase Extraction • Can IndexBERT outperform other approaches used for keyword/keyphrase extraction? • Cross-validated • 8 textbooks - for fine-tuning IndexBERT • 1 textbook – testing • Local Recall – looks only at the keyphrases present in the textbook currently used as a test set
  • 19. Experiment 2: From Lexical to Semantic Modeling gaussian distribution normal distribution Space X covariance test statistica Napoleon III penicillin gaussian distribution normal distribution Space X covariance test statistica Napoleon III penicillin Space X penicillin Napoleon III STATISTICS Before Fine-tuning After Fine-tuning • Do BERT embeddings of domain-related and domain-unrelated key-phrases change differently after fine-tuning?
  • 20. Experiment 2: From Lexical to Semantic Modeling BETWEEN DOMAIN RELATED TERMS BETWEEN DOMAIN RELATED AND OUT OF DOMAIN TERMS BEFORE TRAINING AFTER TRAINING • We compare the similarity scores of in-domain and out-of- domain key-phrases • Fine-tuning pulls-in in-domain key-phrases • Fine-tuning pushes-out out- of-domain key-phrases
  • 21. Questions? Model extraction service: https://intextbooks.science.uu.nl/ Model extraction code: https://github.com/isaacalpizar/IntelligentTextbooks

Editor's Notes

  1. Appels and Personal Reader
  2. 2 graph-based: TextRank and PageRank Key-BERT: pretrained BERT for keword and keyphrase extraction