SlideShare a Scribd company logo
Isaac Alpizar-Chacon and Sergey Sosnovsky
Utrecht University
Utrecht, The Netherlands
Order out of Chaos: Construction of Knowledge
Models from PDF Textbooks
2
Motivation
Textbooks are high-quality
textual resources
Textbooks are non-
structured resources
Table of Content provides
browsing aid
Index provides searching aid
Authors use their
understanding of the domain
while creating textbooks
Formatting and structuring
conventions provide
meaningful information
Goal
The automated extraction of
machine-readable textbook models
3
Q1: can knowledge be automatically
extracted from textbooks?
Q2: what would be the quality and the
value of such models?
4
Rule-based workflow
PDF as the most common
and challenging format
4 stages 9 steps 39 rules
5
Rule-based workflow
6
Example Rule
• REPEATED_LINES:
1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed
in all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.
7
Elements identified in TOC and Index sections
8
Textbook model
Structure
(sections)
Content (words,
lines, etc.)
Domain
Knowledge
(terms)
9
Accuracy of the extraction of the models
Domains: Statistics, Computer Science, History, Literature
10
Accuracy of the extraction of the models: Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
11
Application of the textbook models
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
Book#1
Chap1
Sub1
Sub2
Chap2 Chap3
Book#2
Chap1
Sub1
Sub2
Sub3
Chap2 Chap3
Sub1
Sub2
Chap4
12
Application of the textbook models
• Linking model:
• A term-based Vector Space Model (VSM) with 1611 terms from two books
• VSM applied to all chapters and sub-chapters of the both books
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model
13
Application of the textbook models: Results
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model
14
Summary
• Our rule-based approach allows the automated extraction of knowledge models
(Q1)
• Our first evaluation experiment shows that the approach is capable of
processing PDF textbooks with high accuracy (Q2)
• The linking of section across textbooks within the same domain demonstrates
the added value of the extracted models (Q2)
Q1: can knowledge be automatically extracted from textbooks?
Q2: what would be the quality and the value of such models?
15
Related work
• We have integrated individual
textbooks within thew same domain
with each other and with the Linked
Open Data cloud using DBpedia
Mean
Venn
Diagram
…
• Our rule-based approach is the
foundation for Intextbooks: a system
capable of transforming PDF textbooks
into intelligent educational resources
16
Future work
• We plan to use the information in both the Table of Contents and the Index
more extensively:
• Each chapter/subchapter can potentially be treated as a topic/subtopic
annotated with terms in the domain thanks to the explicit connections
between the terms in the index section and the different content sections
Thank you!
https://github.com/intextbooks/ITCore
https://intextbooks.science.uu.nl
Contact:
Isaac Alpizar-Chacon <i.alpizarchacon@uu.nl>

More Related Content

What's hot

Kr Pawan
Kr Pawan Kr Pawan
Kr Pawan
furianpandit
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
Vaibhav Khanna
 
A New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning AssessmentA New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning Assessment
Marco Kalz
 
Data wrangling week1
Data wrangling week1Data wrangling week1
Data wrangling week1
Ferdin Joe John Joseph PhD
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
gleicher
 
Paper Evaluation research methodology
Paper Evaluation research methodologyPaper Evaluation research methodology
Paper Evaluation research methodology
Engr Syed Absar Kazmi
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and English
Faculty of Computer Science
 
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
Fred Kozlov
 
Ran zhou poster 2018
Ran zhou poster 2018Ran zhou poster 2018
Ran zhou poster 2018
Ran Zhou
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
Dmitry Kan
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Yandex
 
Generating SPSS training materials in StatJR
Generating SPSS training materials in StatJRGenerating SPSS training materials in StatJR
Generating SPSS training materials in StatJR
University of Southampton
 
Mobile Computing
Mobile ComputingMobile Computing
Mobile Computing
Shehrevar Davierwala
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
butest
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
ijaia
 
Improving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural LanguageImproving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural Language
Jinho Choi
 
Abis04
Abis04Abis04
Abis04
Martin Homik
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Jinho Choi
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-Answering
Jinho Choi
 
Research Data Mantra - March 2011
Research Data Mantra - March 2011Research Data Mantra - March 2011
Research Data Mantra - March 2011
EDINA, University of Edinburgh
 

What's hot (20)

Kr Pawan
Kr Pawan Kr Pawan
Kr Pawan
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
A New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning AssessmentA New Linkage for Prior Learning Assessment
A New Linkage for Prior Learning Assessment
 
Data wrangling week1
Data wrangling week1Data wrangling week1
Data wrangling week1
 
Interactive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector EmbeddingsInteractive Analysis of Word Vector Embeddings
Interactive Analysis of Word Vector Embeddings
 
Paper Evaluation research methodology
Paper Evaluation research methodologyPaper Evaluation research methodology
Paper Evaluation research methodology
 
Question Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and EnglishQuestion Answering for Machine Reading Evaluation on Romanian and English
Question Answering for Machine Reading Evaluation on Romanian and English
 
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...
 
Ran zhou poster 2018
Ran zhou poster 2018Ran zhou poster 2018
Ran zhou poster 2018
 
Machine translation course program (in English)
Machine translation course program (in English)Machine translation course program (in English)
Machine translation course program (in English)
 
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...
 
Generating SPSS training materials in StatJR
Generating SPSS training materials in StatJRGenerating SPSS training materials in StatJR
Generating SPSS training materials in StatJR
 
Mobile Computing
Mobile ComputingMobile Computing
Mobile Computing
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONSUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATION
 
Improving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural LanguageImproving Document Clustering by Eliminating Unnatural Language
Improving Document Clustering by Eliminating Unnatural Language
 
Abis04
Abis04Abis04
Abis04
 
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Cross-domain Document Retrieval: Matching between Conversational and Formal W...
Cross-domain Document Retrieval: Matching between Conversational and Formal W...
 
HyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-AnsweringHyperQA: A Framework for Complex Question-Answering
HyperQA: A Framework for Complex Question-Answering
 
Research Data Mantra - March 2011
Research Data Mantra - March 2011Research Data Mantra - March 2011
Research Data Mantra - March 2011
 

Similar to Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Sergey Sosnovsky
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Sergey Sosnovsky
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
eMadrid network
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
Kriti Khanna
 
K-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateK-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE Update
Tony Vlachakis
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Aliabbas Petiwala
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Stian Håklev
 
F0372032035
F0372032035F0372032035
F0372032035
inventionjournals
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Francesco Osborne
 
Online Lecture May 2015
Online Lecture May 2015Online Lecture May 2015
Online Lecture May 2015
Yasuhisa Tamura
 
Design Patterns - General Introduction
Design Patterns - General IntroductionDesign Patterns - General Introduction
Design Patterns - General Introduction
Asma CHERIF
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Hung Chau
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
Sergey Sosnovsky
 
Training Module Project Plan
Training Module Project PlanTraining Module Project Plan
Training Module Project Plan
Sherri Orwick Ogden
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
Nane Kratzke
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
dgarijo
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
University of Bologna
 

Similar to Order out of Chaos: Construction of Knowledge Models from PDF Textbooks (20)

Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...
 
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMMULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAM
 
score based ranking of documents
score based ranking of documentsscore based ranking of documents
score based ranking of documents
 
K-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE UpdateK-12 Computer Science Framework GaDOE Update
K-12 Computer Science Framework GaDOE Update
 
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
 
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...
 
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at ScaleOrchestration Graphs: Enabling Rich Learning Scenarios at Scale
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
 
F0372032035
F0372032035F0372032035
F0372032035
 
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerAutomatic Classification of Springer Nature Proceedings with Smart Topic Miner
Automatic Classification of Springer Nature Proceedings with Smart Topic Miner
 
Online Lecture May 2015
Online Lecture May 2015Online Lecture May 2015
Online Lecture May 2015
 
Design Patterns - General Introduction
Design Patterns - General IntroductionDesign Patterns - General Introduction
Design Patterns - General Introduction
 
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...
 
What's in a textbook
What's in a textbookWhat's in a textbook
What's in a textbook
 
Training Module Project Plan
Training Module Project PlanTraining Module Project Plan
Training Module Project Plan
 
Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...Smart like a Fox: How clever students trick dumb programming assignment asses...
Smart like a Fox: How clever students trick dumb programming assignment asses...
 
Knowledge Representation on the Web
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
 
Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015Creating abstractions from scientific workflows: PhD symposium 2015
Creating abstractions from scientific workflows: PhD symposium 2015
 
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema Variety
 

Recently uploaded

Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
RamseyBerglund
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
nitinpv4ai
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
PsychoTech Services
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
Steve Thomason
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
Celine George
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
heathfieldcps1
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
zuzanka
 
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
Nguyen Thanh Tu Collection
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
deepaannamalai16
 
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
Payaamvohra1
 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
nitinpv4ai
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
IsmaelVazquez38
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
Celine George
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
Celine George
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
TechSoup
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
Celine George
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
indexPub
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
danielkiash986
 

Recently uploaded (20)

Electric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger HuntElectric Fetus - Record Store Scavenger Hunt
Electric Fetus - Record Store Scavenger Hunt
 
Skimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S EliotSkimbleshanks-The-Railway-Cat by T S Eliot
Skimbleshanks-The-Railway-Cat by T S Eliot
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...Gender and Mental Health - Counselling and Family Therapy Applications and In...
Gender and Mental Health - Counselling and Family Therapy Applications and In...
 
A Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two HeartsA Visual Guide to 1 Samuel | A Tale of Two Hearts
A Visual Guide to 1 Samuel | A Tale of Two Hearts
 
How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17How Barcodes Can Be Leveraged Within Odoo 17
How Barcodes Can Be Leveraged Within Odoo 17
 
The basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptxThe basics of sentences session 7pptx.pptx
The basics of sentences session 7pptx.pptx
 
SWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptxSWOT analysis in the project Keeping the Memory @live.pptx
SWOT analysis in the project Keeping the Memory @live.pptx
 
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
CHUYÊN ĐỀ ÔN TẬP VÀ PHÁT TRIỂN CÂU HỎI TRONG ĐỀ MINH HỌA THI TỐT NGHIỆP THPT ...
 
HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.HYPERTENSION - SLIDE SHARE PRESENTATION.
HYPERTENSION - SLIDE SHARE PRESENTATION.
 
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
NIPER 2024 MEMORY BASED QUESTIONS.ANSWERS TO NIPER 2024 QUESTIONS.NIPER JEE 2...
 
Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)Oliver Asks for More by Charles Dickens (9)
Oliver Asks for More by Charles Dickens (9)
 
Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.Bossa N’ Roll Records by Ismael Vazquez.
Bossa N’ Roll Records by Ismael Vazquez.
 
How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17How to Download & Install Module From the Odoo App Store in Odoo 17
How to Download & Install Module From the Odoo App Store in Odoo 17
 
How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17How to Predict Vendor Bill Product in Odoo 17
How to Predict Vendor Bill Product in Odoo 17
 
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17How to Manage Reception Report in Odoo 17
How to Manage Reception Report in Odoo 17
 
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...
 
Pharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brubPharmaceutics Pharmaceuticals best of brub
Pharmaceutics Pharmaceuticals best of brub
 

Order out of Chaos: Construction of Knowledge Models from PDF Textbooks

  • 1. Isaac Alpizar-Chacon and Sergey Sosnovsky Utrecht University Utrecht, The Netherlands Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
  • 2. 2 Motivation Textbooks are high-quality textual resources Textbooks are non- structured resources Table of Content provides browsing aid Index provides searching aid Authors use their understanding of the domain while creating textbooks Formatting and structuring conventions provide meaningful information
  • 3. Goal The automated extraction of machine-readable textbook models 3 Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?
  • 4. 4 Rule-based workflow PDF as the most common and challenging format 4 stages 9 steps 39 rules
  • 6. 6 Example Rule • REPEATED_LINES: 1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃. 2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed in all pages 𝑝 ∈ 𝑃. 3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in all pages 𝑝 ∈ 𝑃.
  • 7. 7 Elements identified in TOC and Index sections
  • 9. 9 Accuracy of the extraction of the models Domains: Statistics, Computer Science, History, Literature
  • 10. 10 Accuracy of the extraction of the models: Results Averages over all domains Text Extraction Our approach: 93.85% PDFBox: 89.72% PdfAct: 84.19% TOC Recognition Precision: 99.92% Recall: 99.92% Index Recognition Precision: 98.56% Recall: 98.13%
  • 11. 11 Application of the textbook models Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4 Book#1 Chap1 Sub1 Sub2 Chap2 Chap3 Book#2 Chap1 Sub1 Sub2 Sub3 Chap2 Chap3 Sub1 Sub2 Chap4
  • 12. 12 Application of the textbook models • Linking model: • A term-based Vector Space Model (VSM) with 1611 terms from two books • VSM applied to all chapters and sub-chapters of the both books • Measure: • NDCG (normalized discounted cumulative gain) at 1, 3, and 5. • Baselines: • TFIDF model • LDA model
  • 13. 13 Application of the textbook models: Results 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 NDCG@1 NDCG@3 NDCG@5 TFIDF LDA TFIDF+LDA Our model
  • 14. 14 Summary • Our rule-based approach allows the automated extraction of knowledge models (Q1) • Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy (Q2) • The linking of section across textbooks within the same domain demonstrates the added value of the extracted models (Q2) Q1: can knowledge be automatically extracted from textbooks? Q2: what would be the quality and the value of such models?
  • 15. 15 Related work • We have integrated individual textbooks within thew same domain with each other and with the Linked Open Data cloud using DBpedia Mean Venn Diagram … • Our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources
  • 16. 16 Future work • We plan to use the information in both the Table of Contents and the Index more extensively: • Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections

Editor's Notes

  1. (pause: 2) Hello and welcome to this presentation. My name is Isaac, I am a PhD student at Utrecht University and I will be describing our work: (pause: 1) Order out of Chaos: construction of knowledge models from PDF textbooks.
  2. (pause: 2) I will start by saying that textbooks are high-quality textual resources, but they are often considered to be non-structure. But, if we look carefully how textbooks are made, they provide a lot of information. The Table of Contents provides browsing aid, and the index provides searching aid and terms in the domain. The authors use their understanding of the domain while creating textbooks, and we use these formatting and structuring conventions to extract meaningful information.
  3. (pause: 2) Our goal is to achieve the automated extraction of machine-readable textbooks models. This goal involves two research questions: (pause: 1) First, can knowledge be automatically extracted from textbooks? And second, what would be the quality and the value of such models? Our work seeks to answer these questions.
  4. (pause: 2) We developed a rule-based approach for the extraction of the knowledge models. We focus on PDF as the most common and challenging digital textbook format. Our workflow has 4 stages, 9 steps, and 39 rules. (pause: 1) The modular nature of the rule-based approach support its gradual refinement. Each time we encounter a new variation of a formatting or structural pattern, we extend the approach by modifying an existing rule or adding a new one.
  5. (pause: 2) In the diagram we can see the complete workflow. The first stage is the text extraction to reconstruct all the words, lines, and pages from the PDF. In the second stage, the workflow assigns role labels, such as section heading, subheading, important text, and body text, to each text fragment. This process facilitates the subsequent recognition of different logical elements of the textbook. The third large stage of the workflow is to recognize all different logical elements within a textbook. First, auxiliary elements such as page numbers and headers are filtered out. Then, the individual entries of the table of contents are recognized and processed. Later, each index term is identified. Finally, individual sections are recognized. In the final stage we construct the textbook model, which can be later enriched with external information.
  6. (pause: 2) To give you one example of how the rules look like, we have the _repeated lines_ rule, which is used to detect general page header and footer. This rule is part of the auxiliary elements filtering step. (pause: 1) First, we create a sample of continuous pages from all the pages in the textbook. Then, if the first lines in each page of the sample are the same, a header is detected and removed in all the pages from the textbook. Footers are detected in a similar way but comparing the last lines in the pages from the sample.
  7. (pause: 2) The rules are used to identify different elements in the textbooks. In the table of contents, we use them to detect the pages that belong to the toc, non-content sections like notation or preface, chapter and subchapter entries, entries that are split in multiple lines, and to identify one of three possible types of tocs: flat, flat-ordered or indented. (pause: 1) For the index sections, the rules identify the pages that belong to the section, the heading and page references of the terms, multiline terms, different types of terms like cross-references, and nested groups of terms.
  8. (pause: 2) At the end of the workflow we construct a textbook model using the Text Encoding Initiative, which is a standard for digital representation of texts. In the model we group the information in 3 categories: structure, content, and domain knowledge. (pause: 1) The structure section contains the name and precise start and end page of each chapter and subchapter of the textbook. The content includes the textual information structured as words, lines, fragments, and pages for each chapter and subchapter. Finally, the domain knowledge contains all the important terms in the domain extracted from the index section.
  9. (pause: 2) To test the accuracy of the extraction of the models, we extracted the models using our rule-based approach and using the epub version of the same textbooks. In the epub textbooks the information is already structured and marked, so it is easy to extract and it is accurate. We hypothesize that if the information obtained from the two versions of a textbook matches, that means the approach processes PDF correctly. (pause: 1) We used textbooks from 4 different domains: Statistics, Computer Science, History, and Literature.
  10. (pause: 2) Results from this first evaluation show that our approach has high accuracy. (pause: 1) For the text extraction aspect, we also compared our approach against 2 other tools as baselines. Our approach achieved the highest similarity, followed by PDFBox and then PdfAct. We don’t reach 100 percent similarity mostly because of formulae, charts, and tables that are images in the epub but text in the PDF version. An additional effect of the rules that improve textual extraction, along with the rules for recognition of page is a cleaner textual version of the textbook, as seen when our approach is compared against the out-of-the-box PDFBox tool that lacks these features. (pause: 1) For the recognition of the individual entries in the Table of Content, we reach a precision and recall of almost 100%. (pause: 1) Precision and recall are also very high for the recognition of the index terms.
  11. (pause: 2) We also study one of the possible knowledge-driven applications of the extracted models: we used models of two textbooks to cross-link relevant sections. The idea is that any chapter or subchapter from the first textbook can be linked to any chapter or subchapter of the second textbook to identify similar sections.
  12. (pause: 2) We constructed a linking model using a term-based Vector Space Model (VSM) with one thousand six hundred eleven terms from the two books. Then, the VSM was applied to all chapters and sub-chapters of the both books. The sections have been annotated by the terms according to the knowledge models extracted from the textbooks’ indices. The inner product of these annotations has been used to compute similarity between all sections of book 1, and sections of book 2. We used the normalized discounted cumulative gain to measure the quality of the ranked documents by relevance. NDCG@1 measures the effectiveness of retrieving the most relevant document, while @3 and @5 measure the capability of the retrieval system to find the first three and five most relevant documents, respectively. We also used a manual linking produced by experts as the ground truth for the NDCG measures. Finally, we used two baselines for comparison: the standard TFIDF model and a LDA model. Both baselines have used the textual content of each part of the textbooks with basic preprocessing (lowercase, stop-words, and stemming).
  13. (pause: 2) The results show that the proposed model consistently outperforms all baselines, as seen with the yellow bar in the graph. (pause: 2) The difference between our model and the baselines is the highest for NDCG@1. The semantic information placed by the authors of textbooks in the index sections and extracted by our approach helps our linking model find 72% of best possible matches between the textbook sections. As the number of potential matches increases the difference between NDCG scores diminishes due to the ceiling effect. (pause: 2)
  14. (pause: 2) As summary, we developed a rule-based approach that allows the automated extraction of knowledge models. This answers our first research question. Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy. And the linking of section across textbooks within the same domain demonstrates the added value of the extracted models. The two evaluation experiments answer our second research question. (pause: 2)
  15. (pause: 2) Related to this work, we have taken individual textbooks within the same domain and integrated them with each other and with the Linked Open Data cloud using DBpedia. For example, individual terms like mean and venn diagram are linked to their corresponding resources in DBpedia. (pause: 2) Also, our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources. (pause: 2)
  16. (pause: 2) As future work, we plan to use the information in both the Table of Contents and the Index more extensively: Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections.
  17. (pause: 2) Finally, I invite you to check out our GitHub project, and to use our web service to create textbooks models. Thank you for your attention! (pause: 2)