Textbooks are educational documents created, structured and formatted by domain experts with the main purpose to explain the knowledge in the domain to a novice. Authors use their understanding of the domain when structuring and formatting the content of a textbook to facilitate this explanation. As a result, the formatting and structural elements of textbooks carry the elements of domain knowledge implicitly encoded by their authors. Our paper presents an extendable approach towards automated extraction of this knowledge from textbooks taking into account their formatting rules and internal structure. We focus on PDF as the most common textbook representation format; however, the overall method is applicable to other formats as well. The evaluation experiments examine the accuracy of the approach, as well as the pragmatic quality of the obtained knowledge models using one of their possible applications --- semantic linking of textbooks in the same domain. The results indicate high accuracy of model construction on symbolic, syntactic and structural levels across textbooks and domains, and demonstrate the added value of the extracted models on the semantic level.
Presented at Document Engineering 2020
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
This document proposes a framework that uses tree embeddings to process mathematical language via encoding equations as trees. The framework includes a novel encoder-decoder architecture that learns representations of mathematical formulae. This approach achieves state-of-the-art performance on formula retrieval tasks by computing the similarity between embedding vectors of query and dataset equations. Future work will explore joint processing of math and text, deploying the system for textbook search, and using the embeddings for open-ended math problem solving.
Integrating Textbooks with Smart Interactive Content for Learning ProgrammingIsaac Alpizar-Chacon
Online textbooks with interactive content emerged as a popular medium for learning programming and other computer science topics. While the textbook component supports acquisition of programming concepts by reading, various types of ``smart'' interactive learning content such as worked examples, code animations, Parson's puzzles, and coding problems allow students to immediately practice and master the newly learned concepts. This paper attempts to automate the time-consuming manual process of augmenting textbooks with ``smart'' interactive content. We introduce an ontology-based approach that can link fragment of text with ``smart'' content activities, demonstrate its application to two practical linking cases, and present the results of its pilot evaluation.
The document discusses a study that trained a GPT-2 model to generate contextual definitions for words based on the provided context. The model was trained on a new dataset containing definition and context pairs from various sources. It was evaluated through surveys where human raters assessed definitions generated by the model for short and long contexts, as well as real human-generated definitions. The results found that while the model performed significantly better at generating definitions for short contexts compared to long ones, human-generated definitions were still significantly more accurate. Areas for improvement included reducing fluctuations depending on context and better interpreting some contexts.
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
This document describes two concept-based approaches for finding similar programming examples to questions: global and local similarity. The global approach measures similarity based on all concepts in examples and questions, while the local approach compares concept subtrees. An evaluation with 12 students solving Java problems found the local approach had slightly better ratings and precision. Further work is needed to personalize example selection based on user knowledge and adaptively visualize the problem-example space.
Transformation of PDF Textbooks into Interactive Educational ResourcesIsaac Alpizar-Chacon
We present Intextbooks - the system for automated conversion of PDF-based textbooks into interactive intelligent Web resources. The papers focuses on the new component of Intextbooks - responsible for transformation of PDF-based content into semantically-annotated HTML/CSS. The architecture of the system, the design of the client application rendering resulting textbooks and a short validation experiment demonstrating the quality of the transformation workflow are presented. Demo video: https://youtu.be/X8sQGbkVYSs
Information retrieval 10 vector and probabilistic modelsVaibhav Khanna
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
This document provides a course description for CS461 Machine Learning, a 4-credit course. The course introduces tools and techniques for modeling complex systems and automatically creating computer programs using machine learning paradigms like inductive generalization, genetic algorithms, and neural networks. Topics may include agent-based modeling, neural networks, and complex systems. Students must be proficient in programming, data structures, and algorithms. The course focuses on theoretical foundations and has students design learning systems through projects.
Mathematical Language Processing via Tree EmbeddingsSergey Sosnovsky
This document proposes a framework that uses tree embeddings to process mathematical language via encoding equations as trees. The framework includes a novel encoder-decoder architecture that learns representations of mathematical formulae. This approach achieves state-of-the-art performance on formula retrieval tasks by computing the similarity between embedding vectors of query and dataset equations. Future work will explore joint processing of math and text, deploying the system for textbook search, and using the embeddings for open-ended math problem solving.
Integrating Textbooks with Smart Interactive Content for Learning ProgrammingIsaac Alpizar-Chacon
Online textbooks with interactive content emerged as a popular medium for learning programming and other computer science topics. While the textbook component supports acquisition of programming concepts by reading, various types of ``smart'' interactive learning content such as worked examples, code animations, Parson's puzzles, and coding problems allow students to immediately practice and master the newly learned concepts. This paper attempts to automate the time-consuming manual process of augmenting textbooks with ``smart'' interactive content. We introduce an ontology-based approach that can link fragment of text with ``smart'' content activities, demonstrate its application to two practical linking cases, and present the results of its pilot evaluation.
The document discusses a study that trained a GPT-2 model to generate contextual definitions for words based on the provided context. The model was trained on a new dataset containing definition and context pairs from various sources. It was evaluated through surveys where human raters assessed definitions generated by the model for short and long contexts, as well as real human-generated definitions. The results found that while the model performed significantly better at generating definitions for short contexts compared to long ones, human-generated definitions were still significantly more accurate. Areas for improvement included reducing fluctuations depending on context and better interpreting some contexts.
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
This document describes two concept-based approaches for finding similar programming examples to questions: global and local similarity. The global approach measures similarity based on all concepts in examples and questions, while the local approach compares concept subtrees. An evaluation with 12 students solving Java problems found the local approach had slightly better ratings and precision. Further work is needed to personalize example selection based on user knowledge and adaptively visualize the problem-example space.
Transformation of PDF Textbooks into Interactive Educational ResourcesIsaac Alpizar-Chacon
We present Intextbooks - the system for automated conversion of PDF-based textbooks into interactive intelligent Web resources. The papers focuses on the new component of Intextbooks - responsible for transformation of PDF-based content into semantically-annotated HTML/CSS. The architecture of the system, the design of the client application rendering resulting textbooks and a short validation experiment demonstrating the quality of the transformation workflow are presented. Demo video: https://youtu.be/X8sQGbkVYSs
Information retrieval 10 vector and probabilistic modelsVaibhav Khanna
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms. It is used in information filtering, information retrieval, indexing and relevancy rankings.
This document provides a course description for CS461 Machine Learning, a 4-credit course. The course introduces tools and techniques for modeling complex systems and automatically creating computer programs using machine learning paradigms like inductive generalization, genetic algorithms, and neural networks. Topics may include agent-based modeling, neural networks, and complex systems. Students must be proficient in programming, data structures, and algorithms. The course focuses on theoretical foundations and has students design learning systems through projects.
This document contains a summary of a candidate's education, work experience, skills, and responsibilities in previous roles. The candidate holds an M.S. in Mathematics and Scientific Computing and has worked as a Data Analyst using R and as an Assistant Professor of Mathematics. Their experience includes web analytics projects using R packages, data analysis, and creating algorithms and graphical representations of data. They also have teaching experience and skills in mathematics, statistics, software like Matlab and databases.
Term weighting assigns a weight to terms in documents to quantify their importance in describing the document's contents. Weights are higher for terms that occur frequently in a document but rarely in other documents. Term frequency in a document and inverse document frequency are used to calculate TF-IDF weights. Term occurrences may be correlated, so term weights should reflect their correlation. For example, terms like "computer" and "network" often appear together in documents about computer networks.
A New Linkage for Prior Learning AssessmentMarco Kalz
Presentation given during the conference ePortfolio2007: Employability and Lifelong Learning in the Knowledge Society
Download the slides under http://dspace.ou.nl
This document outlines the course contents for a data wrangling class. It introduces data wrangling and cleaning as the process of converting raw data into a usable format. The course will cover Python programming, Jupyter notebooks, common Python libraries like Pandas and NumPy, and installing packages. Students will learn the steps of data wrangling including importing, merging, standardizing, and exporting data. The instructor will demonstrate working with data in Jupyter notebooks.
Interactive Analysis of Word Vector Embeddingsgleicher
Word vector embeddings present challenges for interactive analysis due to their high-dimensional nature and complex relationships between words. The authors conducted a task analysis of common uses of word embeddings which revealed 7 linguistic tasks. They designed 3 visualizations - Buddy Plots, Concept Axis Plots, and Co-occurrence Matrices - to support the tasks of understanding word similarities, co-occurrences, and semantic directions within concept axes. An online system implements the visualizations to enable interactive exploration of word vector embeddings.
The document summarizes a group's participation in CLEF exercises for question answering tasks on Romanian and English documents from 2006-2010. It describes their system components, including background knowledge indexing, answer extraction, and results. Their 2011 system achieved an overall accuracy of 0.25 for Romanian and 0.21 for English, showing improvement over previous years but still needing ways to better select answers and handle numerical values and named entities.
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...Fred Kozlov
The paper describes a combined approach to maintaining an E-Learning ontology in dynamic and changing educational environment. The developed NLP algorithm based on morpho-syntactic patterns is applied for terminology extraction from course tasks that allows to interlink extracted terms with the instances of the system’s ontology whenever some educational materials are changed. These links are used to gather statistics, evaluate quality of lectures' and tasks’ materials, analyse students’ answers to the tasks and detect difficult terminology of the course in general (for the teachers) and its understandability in particular (for every student).
This is a poster I did to show my internship work for a computing symposium. The content is about medical document classification using neural network.
Machine translation course program (in English)Dmitry Kan
This is the English version of my Machine Translation course program for the following course slides (in Russian):
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-2911038
and
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-1
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Yandex
Лекция одного из самых известных в России специалистов по машинному обучению Дмитрия Ветрова, который руководит департаментом больших данных и информационного поиска на факультете компьютерных наук, работающим во ВШЭ при поддержке Яндекса.
StatJR is a software system that can interoperate with other statistical software.
For example there is a StatJR template to fit a regression in many packages including SPSS.
SPSS is often used for training in the social sciences.
We have extended StatJR’s functionality so that it can automatically create ‘bespoke’ SPSS training materials.
This document provides information about the Mobile Computing course CS4284/5284. It discusses the course objectives, topics, assessment methods, textbooks, schedule, and expectations. The course aims to provide an overview of important mobile computing and communications issues, grouped into basic issues, mobile network architectures, mobile services, and communication protocols. It will cover topics like cellular networks, mobility management, mobile TCP, and mobile data management. Students will be assessed through exams, projects, assignments, and papers. The goals are for students to understand fundamental problems and solutions in mobile computing and be able to apply their learning.
This document provides an overview of the CSE 591: Machine Learning and Applications course taught by Dr. Jieping Ye at Arizona State University. The following key points are discussed:
- Course information including instructor, time/location, prerequisites, objectives to provide an understanding of machine learning methods and applications.
- Topics covered include clustering, classification, dimensionality reduction, semi-supervised learning, and kernel learning.
- The grading breakdown includes homework, a group project, and an exam. Students are required to participate in class discussions.
- An introduction to machine learning is provided including definitions of supervised vs. unsupervised learning and applications in domains like bioinformatics.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
This document summarizes the features of ActiveMath, an adaptive e-learning system for mathematics. Key components include a knowledge representation of mathematical concepts, a user model that tracks users' mastery of concepts, and a course generator that dynamically creates individualized courses based on the user model and pedagogical rules. The system uses semantic representations and interactive exercises to adaptively present mathematical content at different levels of detail tailored for individual users.
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Jinho Choi
This paper challenges a cross-genre document retrieval task, where the queries are in formal writing and the target documents are in conversational writing. In this task, a query, is a sentence extracted from either a summary or a plot of an episode in a TV show, and the target document consists of transcripts from the corresponding episode. To establish a strong baseline, we employ the current state-of-the-art search engine to perform document retrieval on the dataset collected for this work. We then introduce a structure reranking approach to improve the initial ranking by utilizing syntactic and semantic structures generated by NLP tools. Our evaluation shows an improvement of more than 4% when the structure reranking is applied, which is very promising.
HyperQA: A Framework for Complex Question-AnsweringJinho Choi
This abstract describes the overall framework of our question-answering system designed to answer various types of complex questions. Our framework makes heavy use of natural language processing techniques for the retrieval, ranking, and generation of correct answers. Our approach has been tested on answering arithmetic questions requiring logical reasoning as well as higher-order factoid questions aggregating information across different documents.
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky
Ensuring accessible textbooks for children with disabilities is essential for inclusive education. However, providing native accessibility for educational content remains a challenge. In the mean time, existing educational materials need to be adapted, for example by providing interactive versions to overcome difficulties caused by disabilities. In this context, our project aims to automatically adapt PDF textbooks to make them accessible to children with disabilities. The first step towards this adaptation involves extracting and structuring the content of textbooks. In this paper, we introduce textbook models, propose an automated extraction pipeline, and conduct preliminary experiments. Our textbook models are based on the various activities involved and provide layout and semantic information. They enable normalized and structured representations of educational content at both document and page levels, facilitating the automatic extraction process and the conversion to popular formats such as TEI and DocBook. In order to automatically extract PDF textbooks structure, our experiments, using a state-of-the-art multimodal transformer for a token classification task, demonstrate promising results. However, these experiments also highlight the difficulty of the task, especially cross-textbook collection generalization. Finally, we discuss the extraction pipeline and the directions of future work.
This document contains a summary of a candidate's education, work experience, skills, and responsibilities in previous roles. The candidate holds an M.S. in Mathematics and Scientific Computing and has worked as a Data Analyst using R and as an Assistant Professor of Mathematics. Their experience includes web analytics projects using R packages, data analysis, and creating algorithms and graphical representations of data. They also have teaching experience and skills in mathematics, statistics, software like Matlab and databases.
Term weighting assigns a weight to terms in documents to quantify their importance in describing the document's contents. Weights are higher for terms that occur frequently in a document but rarely in other documents. Term frequency in a document and inverse document frequency are used to calculate TF-IDF weights. Term occurrences may be correlated, so term weights should reflect their correlation. For example, terms like "computer" and "network" often appear together in documents about computer networks.
A New Linkage for Prior Learning AssessmentMarco Kalz
Presentation given during the conference ePortfolio2007: Employability and Lifelong Learning in the Knowledge Society
Download the slides under http://dspace.ou.nl
This document outlines the course contents for a data wrangling class. It introduces data wrangling and cleaning as the process of converting raw data into a usable format. The course will cover Python programming, Jupyter notebooks, common Python libraries like Pandas and NumPy, and installing packages. Students will learn the steps of data wrangling including importing, merging, standardizing, and exporting data. The instructor will demonstrate working with data in Jupyter notebooks.
Interactive Analysis of Word Vector Embeddingsgleicher
Word vector embeddings present challenges for interactive analysis due to their high-dimensional nature and complex relationships between words. The authors conducted a task analysis of common uses of word embeddings which revealed 7 linguistic tasks. They designed 3 visualizations - Buddy Plots, Concept Axis Plots, and Co-occurrence Matrices - to support the tasks of understanding word similarities, co-occurrences, and semantic directions within concept axes. An online system implements the visualizations to enable interactive exploration of word vector embeddings.
The document summarizes a group's participation in CLEF exercises for question answering tasks on Romanian and English documents from 2006-2010. It describes their system components, including background knowledge indexing, answer extraction, and results. Their 2011 system achieved an overall accuracy of 0.25 for Romanian and 0.21 for English, showing improvement over previous years but still needing ways to better select answers and handle numerical values and named entities.
A Combined Method for E-Learning Ontology Population based on NLP and User Ac...Fred Kozlov
The paper describes a combined approach to maintaining an E-Learning ontology in dynamic and changing educational environment. The developed NLP algorithm based on morpho-syntactic patterns is applied for terminology extraction from course tasks that allows to interlink extracted terms with the instances of the system’s ontology whenever some educational materials are changed. These links are used to gather statistics, evaluate quality of lectures' and tasks’ materials, analyse students’ answers to the tasks and detect difficult terminology of the course in general (for the teachers) and its understandability in particular (for every student).
This is a poster I did to show my internship work for a computing symposium. The content is about medical document classification using neural network.
Machine translation course program (in English)Dmitry Kan
This is the English version of my Machine Translation course program for the following course slides (in Russian):
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-2911038
and
http://www.slideshare.net/dmitrykan/introduction-to-machine-translation-1
Дмитрий Ветров. Математика больших данных: тензоры, нейросети, байесовский вы...Yandex
Лекция одного из самых известных в России специалистов по машинному обучению Дмитрия Ветрова, который руководит департаментом больших данных и информационного поиска на факультете компьютерных наук, работающим во ВШЭ при поддержке Яндекса.
StatJR is a software system that can interoperate with other statistical software.
For example there is a StatJR template to fit a regression in many packages including SPSS.
SPSS is often used for training in the social sciences.
We have extended StatJR’s functionality so that it can automatically create ‘bespoke’ SPSS training materials.
This document provides information about the Mobile Computing course CS4284/5284. It discusses the course objectives, topics, assessment methods, textbooks, schedule, and expectations. The course aims to provide an overview of important mobile computing and communications issues, grouped into basic issues, mobile network architectures, mobile services, and communication protocols. It will cover topics like cellular networks, mobility management, mobile TCP, and mobile data management. Students will be assessed through exams, projects, assignments, and papers. The goals are for students to understand fundamental problems and solutions in mobile computing and be able to apply their learning.
This document provides an overview of the CSE 591: Machine Learning and Applications course taught by Dr. Jieping Ye at Arizona State University. The following key points are discussed:
- Course information including instructor, time/location, prerequisites, objectives to provide an understanding of machine learning methods and applications.
- Topics covered include clustering, classification, dimensionality reduction, semi-supervised learning, and kernel learning.
- The grading breakdown includes homework, a group project, and an exam. Students are required to participate in class discussions.
- An introduction to machine learning is provided including definitions of supervised vs. unsupervised learning and applications in domains like bioinformatics.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
Improving Document Clustering by Eliminating Unnatural LanguageJinho Choi
Technical documents contain a fair amount of unnatural language, such as tables, formulas, and pseudo-code. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering by up to 15%. Our corpus and tool are publicly available.
This document summarizes the features of ActiveMath, an adaptive e-learning system for mathematics. Key components include a knowledge representation of mathematical concepts, a user model that tracks users' mastery of concepts, and a course generator that dynamically creates individualized courses based on the user model and pedagogical rules. The system uses semantic representations and interactive exercises to adaptively present mathematical content at different levels of detail tailored for individual users.
Cross-domain Document Retrieval: Matching between Conversational and Formal W...Jinho Choi
This paper challenges a cross-genre document retrieval task, where the queries are in formal writing and the target documents are in conversational writing. In this task, a query, is a sentence extracted from either a summary or a plot of an episode in a TV show, and the target document consists of transcripts from the corresponding episode. To establish a strong baseline, we employ the current state-of-the-art search engine to perform document retrieval on the dataset collected for this work. We then introduce a structure reranking approach to improve the initial ranking by utilizing syntactic and semantic structures generated by NLP tools. Our evaluation shows an improvement of more than 4% when the structure reranking is applied, which is very promising.
HyperQA: A Framework for Complex Question-AnsweringJinho Choi
This abstract describes the overall framework of our question-answering system designed to answer various types of complex questions. Our framework makes heavy use of natural language processing techniques for the retrieval, ranking, and generation of correct answers. Our approach has been tested on answering arithmetic questions requiring logical reasoning as well as higher-order factoid questions aggregating information across different documents.
Harnessing Textbooks for High-Quality Labeled Data: An Approach to Automatic ...Sergey Sosnovsky
As textbooks evolve into digital platforms, they open a world of opportunities for Artificial Intelligence in Education (AIED) research. This paper delves into the novel use of textbooks as a source of high-quality labeled data for automatic keyword extraction, demonstrating an affordable and efficient alternative to traditional methods. By utilizing the wealth of structured information provided in textbooks, we propose a methodology for annotating corpora across diverse domains, circumventing the costly and time-consuming process of manual data annotation. Our research presents a deep learning model based on Bidirectional Encoder Representations from Transformers (BERT) fine-tuned on this newly labeled dataset. This model is applied to keyword extraction tasks, with the model’s performance surpassing established baselines. We further analyze the transformation of BERT’s embedding space before and after the fine-tuning phase, illuminating how the model adapts to specific domain goals. Our findings substantiate textbooks as a resource-rich, untapped well of high-quality labeled data, underpinning their significant role in the AIED research landscape.
Layout- and Activity-based Textbook Modeling for Automatic PDF Textbook Extra...Sergey Sosnovsky
Ensuring accessible textbooks for children with disabilities is essential for inclusive education. However, providing native accessibility for educational content remains a challenge. In the mean time, existing educational materials need to be adapted, for example by providing interactive versions to overcome difficulties caused by disabilities. In this context, our project aims to automatically adapt PDF textbooks to make them accessible to children with disabilities. The first step towards this adaptation involves extracting and structuring the content of textbooks. In this paper, we introduce textbook models, propose an automated extraction pipeline, and conduct preliminary experiments. Our textbook models are based on the various activities involved and provide layout and semantic information. They enable normalized and structured representations of educational content at both document and page levels, facilitating the automatic extraction process and the conversion to popular formats such as TEI and DocBook. In order to automatically extract PDF textbooks structure, our experiments, using a state-of-the-art multimodal transformer for a token classification task, demonstrate promising results. However, these experiments also highlight the difficulty of the task, especially cross-textbook collection generalization. Finally, we discuss the extraction pipeline and the directions of future work.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MULTI-LEARNING SPECIAL SESSION / EDUCON 2018 / EMADRID TEAMeMadrid network
1) The document proposes an approach to assist course creators in generating or restructuring courses by exploiting text mining techniques, semantic information from DBpedia, and linking educational resources.
2) The approach was implemented as a prototype that retrieves online courses, identifies key elements from text, formulates queries to other courses, and returns related courses to help creators generate mashups.
3) Preliminary tests on 265 computer science courses showed promising results, though future work is needed to improve similarity measures and generate concept maps between related courses.
This document describes an approach for ranking documents based on score calculation. It discusses using term frequency, document frequency, and inverse document frequency to assign scores to documents based on their relevance to a query. The approach is implemented using Java and the NetBeans IDE. The objective is to find similar documents to a query document and return a ranked list based on calculated scores. Functions for counting word frequencies, calculating document vectors, and different weighting approaches are described.
The document provides background information on the development of a K-12 computer science framework. It describes the framework's vision and principles, including empowering students to be informed citizens and understand computing's role in the world. The framework will outline computer science concepts and practices for different grade levels. Feedback will be gathered from reviewers and incorporated to improve the framework, which is intended to inform the development of state standards.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
Constructing a Learner Centric Semantic Syllabus for Automatic Text Book Gen...Aliabbas Petiwala
The document discusses developing a semantic syllabus ontology to guide automatic textbook generation. It outlines key aspects of a learner-centric semantic syllabus such as collaborative active learning environments. The proposed ontology would represent syllabus topics and relationships to facilitate data integration and textbook customization for different learners. Future work is needed to specify content granularity and develop a book authoring tool integrated with an active learning community.
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The document summarizes research on automatically classifying Springer Nature proceedings using the Smart Topic Miner (STM). STM extracts topics from publications, maps them to a computer science ontology, selects relevant topics using a greedy algorithm, and infers tags. It was tested on 8 Springer Nature editors who found STM accurately classified 75-90% of proceedings and improved their work. However, STM is currently limited to computer science and occasional noisy results were found in books with few chapters. Future work aims to expand STM to characterize topic evolution over time and directly support author tagging.
This document discusses transforming textbooks for authentic learning through digital and e-textbook technologies. It provides background on the speaker, Prof. Dr. Yasuhisa Tamura, and his research interests. It then discusses definitions and examples of digital textbooks and e-textbooks, highlighting added digital functions. International projects toward standardizing e-textbook formats and functions are summarized, including the EDUPUB specification. National movements toward digital textbooks in various countries are overviewed, and debates around replacing traditional textbooks with e-textbooks are presented. The document concludes with discussions of how classrooms may transform with more learner-centered e-textbook use and references for further reading.
This document provides information about a course on design patterns taught by Dr. Asma Cherif. It includes the course code, instructor details, learning objectives, textbooks, and topics that will be covered such as architectural design, what patterns are, and different types of patterns. Design patterns provide solutions to common software design problems and promote reuse of successful designs. The course aims to help students understand design patterns, identify patterns in code and designs, and implement patterns to design and develop quality software applications.
Content Wizard: Concept-Based Recommender System for Instructors of Programmi...Hung Chau
This presentation introduces Content Wizard, a concept-based recommender system to help instructors of programming courses build intelligent tutoring systems. Content Wizard analyzes code examples to extract concepts and generates a course model. It then recommends annotated examples and parameterized problems for each course unit based on whether concepts are past, current or future. An evaluation compared Content Wizard's recommendations to a baseline method and found it performed better, especially for annotated examples. Future work includes online experiments and improving transparency of recommendations.
This document provides an overview for an interactive learning module on designing databases in Microsoft Access. The module will define what a database is and how it differs from a spreadsheet, explain how to properly plan and design a database, and allow users to practice designing a database through interactive exercises. Content will come from Microsoft manuals and training materials. The module is intended for adults familiar with computers who want to use Access at work. It will cover database planning, categorizing data into tables, determining fields, relationships between tables, and more. The goal is for users to understand databases and be able to separate fields into functional tables with relationships. The module will be created using PowerPoint, Raptivity and Articulate and follow an instructional design
Smart like a Fox: How clever students trick dumb programming assignment asses...Nane Kratzke
This case study reports on two first-semester programming courses with more than 190 students. Both courses made use of automated assessments. We observed how students trick these systems by analysing the version history of suspect submissions. By analysing more than 3300 submissions, we revealed four astonishingly simple tricks (overfitting, evasion) and cheat-patterns (redirection, and injection) that students used to trick automated programming assignment assessment systems (APAAS). Although not the main focus of this study, it discusses and proposes corresponding counter-measures where appropriate.
Nevertheless, the primary intent of this paper is to raise problem awareness and to identify and systematise observable problem patterns in a more formal approach. The identified immaturity of existing APAAS solutions might have implications for courses that rely deeply on automation like MOOCs. Therefore, we conclude to look at APAAS solutions much more from a security point of view (code injection). Moreover, we identify the need to evolve existing unit testing frameworks into more evaluation-oriented teaching solutions that provide better trick and cheat detection capabilities and differentiated grading support.
This document outlines a course on Knowledge Representation (KR) on the Web. The course aims to expose students to challenges of applying traditional KR techniques to the scale and heterogeneity of data on the Web. Students will learn about representing Web data through formal knowledge graphs and ontologies, integrating and reasoning over distributed datasets, and how characteristics such as volume, variety and veracity impact KR approaches. The course involves lectures, literature reviews, and milestone projects where students publish papers on building semantic systems, modeling Web data, ontology matching, and reasoning over large knowledge graphs.
Creating abstractions from scientific workflows: PhD symposium 2015dgarijo
This document discusses the creation of abstractions in scientific workflows. It hypothesizes that it is possible to automatically extract reusable patterns and abstractions from scientific workflow repositories that could be useful for developers. The document outlines challenges in workflow representation, abstraction, reuse, and annotation. It then describes an approach to define vocabularies and methodologies for publishing workflows as linked data. This includes defining a catalog of common workflow abstractions and techniques for finding and evaluating these abstractions across different workflow corpora. Evaluation shows the extracted patterns are similar to those defined by users and are considered useful.
[SEBD2020] OLAP Querying of Document Stores in the Presence of Schema VarietyUniversity of Bologna
Paper presented at SEBD 2020
Document stores are preferred to relational ones for storing heterogeneous data due to their schemaless nature. However, the absence of a unique schema adds complexity to analytical applications. In a previous paper we have proposed an original approach to OLAP on document stores; its basic idea was to stop fighting against schema variety and welcome it as an inherent source of information wealth in schemaless sources. In this paper we focus on the querying phase, showing how queries can be directly rewritten on a heterogeneous collection in an inclusive way, i.e., also including the concepts present in a subset of documents only.
Authors: Matteo Francia, Enrico Gallinucci, Matteo Golfarelli, Stefano Rizzi
Similar to Order out of Chaos: Construction of Knowledge Models from PDF Textbooks (20)
Gender and Mental Health - Counselling and Family Therapy Applications and In...PsychoTech Services
A proprietary approach developed by bringing together the best of learning theories from Psychology, design principles from the world of visualization, and pedagogical methods from over a decade of training experience, that enables you to: Learn better, faster!
A Visual Guide to 1 Samuel | A Tale of Two HeartsSteve Thomason
These slides walk through the story of 1 Samuel. Samuel is the last judge of Israel. The people reject God and want a king. Saul is anointed as the first king, but he is not a good king. David, the shepherd boy is anointed and Saul is envious of him. David shows honor while Saul continues to self destruct.
How Barcodes Can Be Leveraged Within Odoo 17Celine George
In this presentation, we will explore how barcodes can be leveraged within Odoo 17 to streamline our manufacturing processes. We will cover the configuration steps, how to utilize barcodes in different manufacturing scenarios, and the overall benefits of implementing this technology.
How to Download & Install Module From the Odoo App Store in Odoo 17Celine George
Custom modules offer the flexibility to extend Odoo's capabilities, address unique requirements, and optimize workflows to align seamlessly with your organization's processes. By leveraging custom modules, businesses can unlock greater efficiency, productivity, and innovation, empowering them to stay competitive in today's dynamic market landscape. In this tutorial, we'll guide you step by step on how to easily download and install modules from the Odoo App Store.
Elevate Your Nonprofit's Online Presence_ A Guide to Effective SEO Strategies...TechSoup
Whether you're new to SEO or looking to refine your existing strategies, this webinar will provide you with actionable insights and practical tips to elevate your nonprofit's online presence.
This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications.
A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function.
Healing is the body’s response to injury in an attempt to restore normal structure and functions.
Healing can occur in two ways: Regeneration and Repair
There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc.
Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.
How to Manage Reception Report in Odoo 17Celine George
A business may deal with both sales and purchases occasionally. They buy things from vendors and then sell them to their customers. Such dealings can be confusing at times. Because multiple clients may inquire about the same product at the same time, after purchasing those products, customers must be assigned to them. Odoo has a tool called Reception Report that can be used to complete this assignment. By enabling this, a reception report comes automatically after confirming a receipt, from which we can assign products to orders.
THE SACRIFICE HOW PRO-PALESTINE PROTESTS STUDENTS ARE SACRIFICING TO CHANGE T...indexPub
The recent surge in pro-Palestine student activism has prompted significant responses from universities, ranging from negotiations and divestment commitments to increased transparency about investments in companies supporting the war on Gaza. This activism has led to the cessation of student encampments but also highlighted the substantial sacrifices made by students, including academic disruptions and personal risks. The primary drivers of these protests are poor university administration, lack of transparency, and inadequate communication between officials and students. This study examines the profound emotional, psychological, and professional impacts on students engaged in pro-Palestine protests, focusing on Generation Z's (Gen-Z) activism dynamics. This paper explores the significant sacrifices made by these students and even the professors supporting the pro-Palestine movement, with a focus on recent global movements. Through an in-depth analysis of printed and electronic media, the study examines the impacts of these sacrifices on the academic and personal lives of those involved. The paper highlights examples from various universities, demonstrating student activism's long-term and short-term effects, including disciplinary actions, social backlash, and career implications. The researchers also explore the broader implications of student sacrifices. The findings reveal that these sacrifices are driven by a profound commitment to justice and human rights, and are influenced by the increasing availability of information, peer interactions, and personal convictions. The study also discusses the broader implications of this activism, comparing it to historical precedents and assessing its potential to influence policy and public opinion. The emotional and psychological toll on student activists is significant, but their sense of purpose and community support mitigates some of these challenges. However, the researchers call for acknowledging the broader Impact of these sacrifices on the future global movement of FreePalestine.
Order out of Chaos: Construction of Knowledge Models from PDF Textbooks
1. Isaac Alpizar-Chacon and Sergey Sosnovsky
Utrecht University
Utrecht, The Netherlands
Order out of Chaos: Construction of Knowledge
Models from PDF Textbooks
2. 2
Motivation
Textbooks are high-quality
textual resources
Textbooks are non-
structured resources
Table of Content provides
browsing aid
Index provides searching aid
Authors use their
understanding of the domain
while creating textbooks
Formatting and structuring
conventions provide
meaningful information
3. Goal
The automated extraction of
machine-readable textbook models
3
Q1: can knowledge be automatically
extracted from textbooks?
Q2: what would be the quality and the
value of such models?
6. 6
Example Rule
• REPEATED_LINES:
1. Create a sample of pages: 𝑃𝑠 = {𝑝𝑎 , 𝑝𝑏 , . . . , 𝑝𝑚 } | 𝑃𝑠 ⊂ 𝑃.
2. If the first line(s) are identical across 𝑃𝑠 : header is detected and removed
in all pages 𝑝 ∈ 𝑃.
3. If the last line(s) are identical across 𝑃𝑠 : footer is detected and removed in
all pages 𝑝 ∈ 𝑃.
9. 9
Accuracy of the extraction of the models
Domains: Statistics, Computer Science, History, Literature
10. 10
Accuracy of the extraction of the models: Results
Averages over all domains
Text
Extraction
Our approach:
93.85%
PDFBox:
89.72%
PdfAct:
84.19%
TOC
Recognition
Precision:
99.92%
Recall:
99.92%
Index
Recognition
Precision:
98.56%
Recall:
98.13%
12. 12
Application of the textbook models
• Linking model:
• A term-based Vector Space Model (VSM) with 1611 terms from two books
• VSM applied to all chapters and sub-chapters of the both books
• Measure:
• NDCG (normalized discounted cumulative gain) at 1, 3, and 5.
• Baselines:
• TFIDF model
• LDA model
13. 13
Application of the textbook models: Results
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
NDCG@1 NDCG@3 NDCG@5
TFIDF LDA TFIDF+LDA Our model
14. 14
Summary
• Our rule-based approach allows the automated extraction of knowledge models
(Q1)
• Our first evaluation experiment shows that the approach is capable of
processing PDF textbooks with high accuracy (Q2)
• The linking of section across textbooks within the same domain demonstrates
the added value of the extracted models (Q2)
Q1: can knowledge be automatically extracted from textbooks?
Q2: what would be the quality and the value of such models?
15. 15
Related work
• We have integrated individual
textbooks within thew same domain
with each other and with the Linked
Open Data cloud using DBpedia
Mean
Venn
Diagram
…
• Our rule-based approach is the
foundation for Intextbooks: a system
capable of transforming PDF textbooks
into intelligent educational resources
16. 16
Future work
• We plan to use the information in both the Table of Contents and the Index
more extensively:
• Each chapter/subchapter can potentially be treated as a topic/subtopic
annotated with terms in the domain thanks to the explicit connections
between the terms in the index section and the different content sections
(pause: 2)
Hello and welcome to this presentation. My name is Isaac, I am a PhD student at Utrecht University and I will be describing our work:
(pause: 1)
Order out of Chaos: construction of knowledge models from PDF textbooks.
(pause: 2)
I will start by saying that textbooks are high-quality textual resources, but they are often considered to be non-structure. But, if we look carefully how textbooks are made, they provide a lot of information. The Table of Contents provides browsing aid, and the index provides searching aid and terms in the domain. The authors use their understanding of the domain while creating textbooks, and we use these formatting and structuring conventions to extract meaningful information.
(pause: 2)
Our goal is to achieve the automated extraction of machine-readable textbooks models. This goal involves two research questions:
(pause: 1)
First, can knowledge be automatically extracted from textbooks? And second, what would be the quality and the value of such models?
Our work seeks to answer these questions.
(pause: 2)
We developed a rule-based approach for the extraction of the knowledge models. We focus on PDF as the most common and challenging digital textbook format. Our workflow has 4 stages, 9 steps, and 39 rules.
(pause: 1)
The modular nature of the rule-based approach support its gradual refinement. Each time we encounter a new variation of a formatting or structural pattern, we extend the approach by modifying an existing rule or adding a new one.
(pause: 2)
In the diagram we can see the complete workflow. The first stage is the text extraction to reconstruct all the words, lines, and pages from the PDF. In the second stage, the workflow assigns role labels, such as section heading, subheading, important text, and body text, to each text fragment. This process facilitates the subsequent recognition of different logical elements of the textbook. The third large stage of the workflow is to recognize all different logical elements within a textbook. First, auxiliary elements such as page numbers and headers are filtered out. Then, the individual entries of the table of contents are recognized and processed. Later, each index term is identified. Finally, individual sections are recognized. In the final stage we construct the textbook model, which can be later enriched with external information.
(pause: 2)
To give you one example of how the rules look like, we have the _repeated lines_ rule, which is used to detect general page header and footer. This rule is part of the auxiliary elements filtering step.
(pause: 1)
First, we create a sample of continuous pages from all the pages in the textbook. Then, if the first lines in each page of the sample are the same, a header is detected and removed in all the pages from the textbook. Footers are detected in a similar way but comparing the last lines in the pages from the sample.
(pause: 2)
The rules are used to identify different elements in the textbooks. In the table of contents, we use them to detect the pages that belong to the toc, non-content sections like notation or preface, chapter and subchapter entries, entries that are split in multiple lines, and to identify one of three possible types of tocs: flat, flat-ordered or indented.
(pause: 1)
For the index sections, the rules identify the pages that belong to the section, the heading and page references of the terms, multiline terms, different types of terms like cross-references, and nested groups of terms.
(pause: 2)
At the end of the workflow we construct a textbook model using the Text Encoding Initiative, which is a standard for digital representation of texts. In the model we group the information in 3 categories: structure, content, and domain knowledge.
(pause: 1)
The structure section contains the name and precise start and end page of each chapter and subchapter of the textbook. The content includes the textual information structured as words, lines, fragments, and pages for each chapter and subchapter. Finally, the domain knowledge contains all the important terms in the domain extracted from the index section.
(pause: 2)
To test the accuracy of the extraction of the models, we extracted the models using our rule-based approach and using the epub version of the same textbooks. In the epub textbooks the information is already structured and marked, so it is easy to extract and it is accurate. We hypothesize that if the information obtained from the two versions of a textbook matches, that means the approach processes PDF correctly.
(pause: 1)
We used textbooks from 4 different domains: Statistics, Computer Science, History, and Literature.
(pause: 2)
Results from this first evaluation show that our approach has high accuracy.
(pause: 1)
For the text extraction aspect, we also compared our approach against 2 other tools as baselines. Our approach achieved the highest similarity, followed by PDFBox and then PdfAct. We don’t reach 100 percent similarity mostly because of formulae, charts, and tables that are images in the epub but text in the PDF version. An additional effect of the rules that improve textual extraction, along with the rules for recognition of page is a cleaner textual version of the textbook, as seen when our approach is compared against the out-of-the-box PDFBox tool that lacks these features.
(pause: 1)
For the recognition of the individual entries in the Table of Content, we reach a precision and recall of almost 100%.
(pause: 1)
Precision and recall are also very high for the recognition of the index terms.
(pause: 2)
We also study one of the possible knowledge-driven applications of the extracted models: we used models of two textbooks to cross-link relevant sections. The idea is that any chapter or subchapter from the first textbook can be linked to any chapter or subchapter of the second textbook to identify similar sections.
(pause: 2)
We constructed a linking model using a term-based Vector Space Model (VSM) with one thousand six hundred eleven terms from the two books. Then, the VSM was applied to all chapters and sub-chapters of the both books. The sections have been annotated by the terms according to the knowledge models extracted from the textbooks’ indices. The inner product of these annotations has been used to compute similarity between all sections of book 1, and sections of book 2.
We used the normalized discounted cumulative gain to measure the quality of the ranked documents by relevance. NDCG@1 measures the effectiveness of retrieving the most relevant document, while @3 and @5 measure the capability of the retrieval system to find the first three and five most relevant documents, respectively. We also used a manual linking produced by experts as the ground truth for the NDCG measures.
Finally, we used two baselines for comparison: the standard TFIDF model and a LDA model. Both baselines have used the textual content of each part of the textbooks with basic preprocessing (lowercase, stop-words, and stemming).
(pause: 2)
The results show that the proposed model consistently outperforms all baselines, as seen with the yellow bar in the graph.
(pause: 2)
The difference between our model and the baselines is the highest for NDCG@1.
The semantic information placed by the authors of textbooks in the index sections and extracted by our approach helps our linking model find 72% of best possible matches between the textbook sections. As the number of potential matches increases the difference between NDCG scores diminishes due to the ceiling effect.
(pause: 2)
(pause: 2)
As summary, we developed a rule-based approach that allows the automated extraction of knowledge models. This answers our first research question.
Our first evaluation experiment shows that the approach is capable of processing PDF textbooks with high accuracy.
And the linking of section across textbooks within the same domain demonstrates the added value of the extracted models.
The two evaluation experiments answer our second research question.
(pause: 2)
(pause: 2)
Related to this work, we have taken individual textbooks within the same domain and integrated them with each other and with the Linked Open Data cloud using DBpedia. For example, individual terms like mean and venn diagram are linked to their corresponding resources in DBpedia.
(pause: 2)
Also, our rule-based approach is the foundation for Intextbooks: a system capable of transforming PDF textbooks into intelligent educational resources.
(pause: 2)
(pause: 2)
As future work, we plan to use the information in both the Table of Contents and the Index more extensively:
Each chapter/subchapter can potentially be treated as a topic/subtopic annotated with terms in the domain thanks to the explicit connections between the terms in the index section and the different content sections.
(pause: 2)
Finally, I invite you to check out our GitHub project, and to use our web service to create textbooks models.
Thank you for your attention!
(pause: 2)