Table of content (TOC) detection has drawn attention now a day because it plays an important role in
digitization of multi-page document. Generally book document is multi-page document. So it becomes
necessary to detect Table of Content page for easy navigation of multi-page document and also to make
information retrieval faster for desirable data from the multipage document. All the Table of content pages
follow the different layout, different way of presenting the contents of the document like chapter, section,
subsection etc. This paper introduces a new method to detect Table of content using machine learning
technique with different features. With the main aim to detect Table of Content pages is to structure the
document according to their contents.
A statistical model for gist generation a case study on hindi news articleIJDKP
Â
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Summary of Research on Text Information and Enterprise DelistingYogeshIJTSRD
Â
Limited by research means, text information research is always in the situation of small number and small samples. It was not until computer technology was applied to relevant research that a large number of studies on accounting text information emerged, and the trend of publishing literature abroad was impressive. In China, with the expansion of research data acquisition channels, it is difficult to find unique features in simple data research, and text information research gradually enters the attention of scholars. By combing domestic and foreign literature, it is found that at present, the research on text information in China is still in its infancy, and the number and Angle of research have not been explored yet, which is of great research value. Ren Yueqiao | Ren Mengying "Summary of Research on Text Information and Enterprise Delisting" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd41120.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/41120/summary-of-research-on-text-information-and-enterprise-delisting/ren-yueqiao
A statistical model for gist generation a case study on hindi news articleIJDKP
Â
Every day, huge number of news articles are reported and disseminated on the Internet. By generating gist
of an article, reader can go through the main topics instead of reading the whole article as it takes much
time for reader to read the entire content of the article. An ideal system would understand the document
and generate the appropriate theme(s) directly from the results of the understanding. In the absence of
natural language understanding system, it is required to design an appropriate system. Gist generation is a
difficult task because it requires both maximizing text content in short summary and maintains
grammaticality of the text. In this paper we present a statistical approach to generate a gist of a Hindi
news article. The experimental results are evaluated using the standard measures such as precision, recall
and F1 measure for different statistical models and their combination on the article before pre-processing
and after pre-processing.
The enormous amount of information stored in unstructured texts cannot simply be used for further
processing by computers, which typically handle text as simple sequences of character strings. Therefore, specific
(pre-) processing methods and algorithms are required in order to extract useful patterns. Text Mining is the
discovery of valuable, yet hidden, information from the text document. Text classification (Also called Text
Categorization) is one of the important research issues in the field of text mining. It is necessary to
classify/categorize large texts (documents) into specific classes. Text Classification assigns a text document to one of a
set of predefined classes. This paper covers different text classification techniques and also includes Classifier
Architecture and Text Classification Applications.
Summary of Research on Text Information and Enterprise DelistingYogeshIJTSRD
Â
Limited by research means, text information research is always in the situation of small number and small samples. It was not until computer technology was applied to relevant research that a large number of studies on accounting text information emerged, and the trend of publishing literature abroad was impressive. In China, with the expansion of research data acquisition channels, it is difficult to find unique features in simple data research, and text information research gradually enters the attention of scholars. By combing domestic and foreign literature, it is found that at present, the research on text information in China is still in its infancy, and the number and Angle of research have not been explored yet, which is of great research value. Ren Yueqiao | Ren Mengying "Summary of Research on Text Information and Enterprise Delisting" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd41120.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/41120/summary-of-research-on-text-information-and-enterprise-delisting/ren-yueqiao
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONijnlc
Â
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as NaĂŻve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Â
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
Â
In past few years it is observed that the internet usage is been grown wider all over the world, hence, the data generation and usage is been increased rapidly by the users, the data generated in different forms may or may not be structured. The usage of internet by individuals and organizations have been grown so, there is increasing quantity and diversity of digital data in the form of documents, became available to the end users. The Storage, Maintenance and organization of such huge data in databases is a challenging task. So, there is a great need of efficient and effective retrieval technique which focuses on improving the accuracy of document retrieval. In this paper we are going to discuss about document retrieval using context based indexing approach. Here lexical association between terms is used to separate content carrying terms and other-terms. Content carrying terms are used as they give idea about theme of the document. Indexing weight calculation is done for content carrying terms. Lexical association measure is used to calculate indexing weight of terms. The term having higher indexing weight is considered as important and sentence which contains these terms is also important. When user enters search query, the important terms are matched with the terms with higher weights in order to retrieve documents. The explicit semantic relation or frequent co-occurrence of terms is been considered in this context based indexing.
Lexicon Based Emotion Analysis on Twitter Dataijtsrd
Â
This paper presents a system that extracts information from automatically annotated tweets using well known existing opinion lexicons and supervised machine learning approach. In this paper, the sentiment features are primarily extracted from novel high coverage tweet specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment word hashtags and from tweets with emoticons. The sentence level or tweet level classification is done based on these word level sentiment features by using Sequential Minimal Optimization SMO classifier. SemEval 2013 Twitter sentiment dataset is applied in this work. The ablation experiments show that this system gains in F Score of up to 6.8 absolute percentage points. Nang Noon Kham "Lexicon Based Emotion Analysis on Twitter Data" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26566.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26566/lexicon-based-emotion-analysis-on-twitter-data/nang-noon-kham
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...cseij
Â
Query sensitive summarization aims at providing the users with the summary of the contents of single or multiple web pages based on the search query. This paper proposes a novel idea of generating a comparative summary from a set of URLs from the search result. User selects a set of web page links from the search result produced by search engine. Comparative summary of these selected web sites is generated. This method makes use of HTML DOM tree structure of these web pages. HTML documents are segmented into set of concept blocks. Sentence score of each concept block is computed with respect to the query and feature keywords. The important sentences from the concept blocks of different web pages are extracted to compose the comparative summary on the fly. This system reduces the time and effort required for the user to browse various web sites to compare the information. The comparative summary of the contents would help the users in quick decision making.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
Â
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Regression model focused on query for multi documents summarization based on ...TELKOMNIKA JOURNAL
Â
Document summarization is needed to get the information effectively and efficiently. One method
used to obtain the document summarization by applying machine learning techniques. This paper
proposes the application of regression models to query-focused multi-document summarization based on
the significance of the sentence position. The method used is the Support Vector Regression (SVR) which
estimates the weight of the sentence on a set of documents to be made as a summary based on sentence
feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005.
From the test results obtained summary which has an average precision and recall values of 0.0580 and
0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using
the proposed regression-SU4. Model can perform measurements of the significance of the position of
the sentence in the document well.
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMijnlc
Â
In recent years, we have witnessed the rapid development of deep neural networks and distributed representations in natural language processing. However, the applications of neural networks in resume parsing lack systematic investigation. In this study, we proposed an end-to-end pipeline for resume parsing based on neural networks-based classifiers and distributed embeddings. This pipeline leverages the position-wise line information and integrated meanings of each text block. The coordinated line classification by both line type classifier and line label classifier effectively segment a resume into predefined text blocks. Our proposed pipeline joints the text block segmentation with the identification of resume facts in which various sequence labelling classifiers perform named entity recognition within labelled text blocks. Comparative evaluation of four sequence labelling classifiers confirmed BLSTMCNNs-CRFâs superiority in named entity recognition task. Further comparison among three publicized resume parsers also determined the effectiveness of our text block classification method.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Â
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Â
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
This is the ppt that contain the basic topics of HTML. This is helpful to improve the knowledge of html and It is also for beginner.
That gives the basic knowledge of html.
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
Â
This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines,
and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the
World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search
provides the user with results for queries on that domain. Meta Search Engines send the userâs search
queries to various search engines and combine the search results. This paper introduces intelligent user
interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search
Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate
search engines. Few algorithms are designed to combine results from various search engines and also to
display the results.
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
Â
Classification of Text Document points towards associating one or more predefined categories based
on the likelihood expressed by the training set of labeled documents. Many machine learning algorithms plays
an important role in training the system with predefined categories. The importance of Machine learning
approach has felt because of which the study has been taken up for text document classification based on the
statistical event models available. The aim of this paper is to present the important techniques and
methodologies that are employed for text documents classification, at the same time making awareness of some
of the interesting challenges that remain to be solved, focused mainly on text representation and machine
learning techniques.
MACHINE LEARNING ALGORITHMS FOR MYANMAR NEWS CLASSIFICATIONijnlc
Â
Text classification is a very important research area in machine learning. Artificial Intelligence is reshaping text classification techniques to better acquire knowledge. In spite of the growth and spread of AI in text mining research for various languages such as English, Japanese, Chinese, etc., its role with respect to Myanmar text is not well understood yet. The aim of this paper is comparative study of machine learning algorithms such as NaĂŻve Bayes (NB), k-nearest neighbours (KNN), support vector machine (SVM) algorithms for Myanmar Language News classification. There is no comparative study of machine learning algorithms in Myanmar News. The news is classified into one of four categories (political, Business, Entertainment and Sport). Dataset is collected from 12,000 documents belongs to 4 categories. Well-known algorithms are applied on collected Myanmar language News dataset from websites. The goal of text classification is to classify documents into a certain number of pre-defined categories. News corpus is used for training and testing purpose of the classifier. Feature selection method, chi square algorithm achieves comparable performance across a number of classifiers. In this paper, the experimental results also show support vector machine is better accuracy to other classification algorithms employed in this research. Due to Myanmar Language is complex, it is more important to study and understand the nature of data before proceeding into mining.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
Â
Abstract
There are many data mining techniques have been proposed for mining useful patterns from documents. Still, how to effectively use and update discovered patterns is open for future research , especially in the field of text mining. As most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy(words have multiple meanings) and synonymy(multiple words have same meaning). People have held hypothesis that pattern-based approaches should perform better than the term-based, but many experiments does not support this hypothesis. This paper presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern matching, to improve the effective use of discovered patterns.
Keywords: Pattern Mining, Pattern Taxonomy Model, Inner Pattern Evolving, TF-IDF, NLP etc.
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
Â
In past few years it is observed that the internet usage is been grown wider all over the world, hence, the data generation and usage is been increased rapidly by the users, the data generated in different forms may or may not be structured. The usage of internet by individuals and organizations have been grown so, there is increasing quantity and diversity of digital data in the form of documents, became available to the end users. The Storage, Maintenance and organization of such huge data in databases is a challenging task. So, there is a great need of efficient and effective retrieval technique which focuses on improving the accuracy of document retrieval. In this paper we are going to discuss about document retrieval using context based indexing approach. Here lexical association between terms is used to separate content carrying terms and other-terms. Content carrying terms are used as they give idea about theme of the document. Indexing weight calculation is done for content carrying terms. Lexical association measure is used to calculate indexing weight of terms. The term having higher indexing weight is considered as important and sentence which contains these terms is also important. When user enters search query, the important terms are matched with the terms with higher weights in order to retrieve documents. The explicit semantic relation or frequent co-occurrence of terms is been considered in this context based indexing.
Lexicon Based Emotion Analysis on Twitter Dataijtsrd
Â
This paper presents a system that extracts information from automatically annotated tweets using well known existing opinion lexicons and supervised machine learning approach. In this paper, the sentiment features are primarily extracted from novel high coverage tweet specific sentiment lexicons. These lexicons are automatically generated from tweets with sentiment word hashtags and from tweets with emoticons. The sentence level or tweet level classification is done based on these word level sentiment features by using Sequential Minimal Optimization SMO classifier. SemEval 2013 Twitter sentiment dataset is applied in this work. The ablation experiments show that this system gains in F Score of up to 6.8 absolute percentage points. Nang Noon Kham "Lexicon Based Emotion Analysis on Twitter Data" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26566.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26566/lexicon-based-emotion-analysis-on-twitter-data/nang-noon-kham
QUERY SENSITIVE COMPARATIVE SUMMARIZATION OF SEARCH RESULTS USING CONCEPT BAS...cseij
Â
Query sensitive summarization aims at providing the users with the summary of the contents of single or multiple web pages based on the search query. This paper proposes a novel idea of generating a comparative summary from a set of URLs from the search result. User selects a set of web page links from the search result produced by search engine. Comparative summary of these selected web sites is generated. This method makes use of HTML DOM tree structure of these web pages. HTML documents are segmented into set of concept blocks. Sentence score of each concept block is computed with respect to the query and feature keywords. The important sentences from the concept blocks of different web pages are extracted to compose the comparative summary on the fly. This system reduces the time and effort required for the user to browse various web sites to compare the information. The comparative summary of the contents would help the users in quick decision making.
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
Â
The internet has caused a humongous growth in the number of documents available online. Summaries of documents can help find the right information and are particularly effective when the document base is very large. Keywords are closely associated to a document as they reflect the document's content and act as indices for a given document. In this work, we present a method to produce extractive summaries of documents in the Kannada language, given number of sentences as limitation. The algorithm extracts key words from pre-categorized Kannada documents collected from online resources. We use two feature selection techniques for obtaining features from documents, then we combine scores obtained by GSS (Galavotti, Sebastiani, Simi) coefficients and IDF (Inverse Document Frequency) methods along with TF (Term Frequency) for extracting key words and later use these for summarization based on rank of the sentence. In the current implementation, a document from a given category is selected from our database and depending on the number of sentences given by the user, a summary is generated.
Regression model focused on query for multi documents summarization based on ...TELKOMNIKA JOURNAL
Â
Document summarization is needed to get the information effectively and efficiently. One method
used to obtain the document summarization by applying machine learning techniques. This paper
proposes the application of regression models to query-focused multi-document summarization based on
the significance of the sentence position. The method used is the Support Vector Regression (SVR) which
estimates the weight of the sentence on a set of documents to be made as a summary based on sentence
feature which has been defined previously. A series of evaluations performed on a data set of DUC 2005.
From the test results obtained summary which has an average precision and recall values of 0.0580 and
0.0590 for measurements using ROUGE-2, ROUGE 0.0997 and 0.1019 for measurements using
the proposed regression-SU4. Model can perform measurements of the significance of the position of
the sentence in the document well.
RESUME INFORMATION EXTRACTION WITH A NOVEL TEXT BLOCK SEGMENTATION ALGORITHMijnlc
Â
In recent years, we have witnessed the rapid development of deep neural networks and distributed representations in natural language processing. However, the applications of neural networks in resume parsing lack systematic investigation. In this study, we proposed an end-to-end pipeline for resume parsing based on neural networks-based classifiers and distributed embeddings. This pipeline leverages the position-wise line information and integrated meanings of each text block. The coordinated line classification by both line type classifier and line label classifier effectively segment a resume into predefined text blocks. Our proposed pipeline joints the text block segmentation with the identification of resume facts in which various sequence labelling classifiers perform named entity recognition within labelled text blocks. Comparative evaluation of four sequence labelling classifiers confirmed BLSTMCNNs-CRFâs superiority in named entity recognition task. Further comparison among three publicized resume parsers also determined the effectiveness of our text block classification method.
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
Â
Abstract This paper describes an automated approach for extracting significant and useful events from unstructured text. The goal of research is to come out with a methodology which helps in extracting important events such as dates, places, and subjects of interest. It would be also convenient if the methodology helps in presenting the users with a shorter version of the text which contain all non-trivial information. We also discuss implementation of algorithms which exactly does this task, developed by us. Key Words: Cosine Similarity, Information, Natural Language, Summarization, Text Mining
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Â
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
This is the ppt that contain the basic topics of HTML. This is helpful to improve the knowledge of html and It is also for beginner.
That gives the basic knowledge of html.
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
Â
This paper analyses the features of Web Search Engines, Vertical Search Engines, Meta Search Engines,
and proposes a Meta Search Engine for searching and retrieving documents on Multiple Domains in the
World Wide Web (WWW). A web search engine searches for information in WWW. A Vertical Search
provides the user with results for queries on that domain. Meta Search Engines send the userâs search
queries to various search engines and combine the search results. This paper introduces intelligent user
interfaces for selecting domain, category and search engines for the proposed Multi-Domain Meta Search
Engine. An intelligent User Interface is also designed to get the user query and to send it to appropriate
search engines. Few algorithms are designed to combine results from various search engines and also to
display the results.
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
Â
Classification of Text Document points towards associating one or more predefined categories based
on the likelihood expressed by the training set of labeled documents. Many machine learning algorithms plays
an important role in training the system with predefined categories. The importance of Machine learning
approach has felt because of which the study has been taken up for text document classification based on the
statistical event models available. The aim of this paper is to present the important techniques and
methodologies that are employed for text documents classification, at the same time making awareness of some
of the interesting challenges that remain to be solved, focused mainly on text representation and machine
learning techniques.
Survey on Key Phrase Extraction using Machine Learning ApproachesYogeshIJTSRD
Â
The automated keyword extraction task is to define a collection of representative terms for the text. Extracting keywords defines a small collection of terms, key phrases and keywords that define the documentâs context. Keyword search allows large document collections to be searched effectively. To allocate suitable key phrases to new documents, text categorization techniques can be applied. A predefined collection of key phrases from which all key phrases for new documents are selected is given in the training documents. The training data for each key phrase describes a collection of documents associated with it. Standard machine learning techniques are used for each key phrase to construct a classifier from the training materials, using those relevant to it as positive examples and the rest as negative examples. Provided a new text, it is processed by the classifier of each key phrase. Preeti Sondhi | Aakib Jabbar "Survey on Key Phrase Extraction using Machine Learning Approaches" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-3 , April 2021, URL: https://www.ijtsrd.com/papers/ijtsrd39890.pdf Paper URL: https://www.ijtsrd.com/other-scientific-research-area/other/39890/survey-on-key-phrase-extraction-using-machine-learning-approaches/preeti-sondhi
16Â Â Â Â Â Decision Support and Business Intelligence Systems (9th E.docxRAJU852744
Â
16Â Â Â Â Â Decision Support and Business Intelligence Systems (9th Edition) Instructorâs Manual
Chapter 7:
Text Analytics, Text Mining, and Sentiment Analysis
Learning Objectives for Chapter 7
1. Describe text mining and understand the need for text mining
2. Differentiate among text analytics, text mining, and data mining
3. Understand the different application areas for text mining
4. Know the process of carrying out a text mining project
5. Appreciate the different methods to introduce structure to text-based data
6. Describe sentiment analysis
7. Develop familiarity with popular applications of sentiment analysis
8. Learn the common methods for sentiment analysis
9. Become familiar with speech analytics as it relates to sentiment analysis
10. Learn three facets of Web analyticsâcontent, structure, and usage mining
11. Know social analytics including social media and social network analyses
CHAPTER OVERVIEW
This chapter provides a comprehensive overview of text analytics/mining and Web analytics/mining along with their popular application areas such as search engines, sentiment analysis, and social network/media analytics. As we have been witnessing in recent years, the unstructured data generated over the Internet of Things (IoT) (Web, sensor networks, radio-frequency identification [RFID]âenabled supply chain systems, surveillance networks, etc.) are increasing at an exponential pace, and there is no indication of its slowing down. This changing nature of data is forcing organizations to make text and Web analytics a critical part of their business intelligence/analytics infrastructure.
CHAPTER OUTLINE
7.1 Opening Vignette: Amadori Group Converts Consumer Sentiments into
Near-Real-Time Sales
7.2 Text Analytics and Text Mining Overview
7.3 Natural Language Processing (NLP)
7.4 Text Mining Applications
7.5 Text Mining Process
7.6 Sentiment Analysis
7.7 Web Mining Overview
7.8 Search Engines
7.9 Web Usage Mining
7.10 Social Analytics
ANSWERS TO END OF SECTION REVIEW QUESTIONS( ( ( ( ( (
Section 7.1 Review Questions
1. According to the vignette and based on your opinion, what are the challenges that the food industry is facing today?
Student perceptions may vary, but some common themes related to the challenges faced by the food industry could include the changing nature and role of food in peopleâs lifestyles, the shift towards pre-prepared or easily prepared food, and the growing importance of marketing to keep customers interested in brands.
2. How can analytics help businesses in the food industry to survive and thrive in this competitive marketplace?
Analytics can serve dual purposes by both tracking customer interest in the brand as well as providing valuable feedback on customer preferences. An analytics system can be used to evaluate the traffic to various brand marketing campaigns (website or social) that play a pivotal role in ensuring that products are being shown to new pot.
16Â Â Â Â Â Decision Support and Business Intelligence Systems (9th E.docxherminaprocter
Â
16Â Â Â Â Â Decision Support and Business Intelligence Systems (9th Edition) Instructorâs Manual
Chapter 7:
Text Analytics, Text Mining, and Sentiment Analysis
Learning Objectives for Chapter 7
1. Describe text mining and understand the need for text mining
2. Differentiate among text analytics, text mining, and data mining
3. Understand the different application areas for text mining
4. Know the process of carrying out a text mining project
5. Appreciate the different methods to introduce structure to text-based data
6. Describe sentiment analysis
7. Develop familiarity with popular applications of sentiment analysis
8. Learn the common methods for sentiment analysis
9. Become familiar with speech analytics as it relates to sentiment analysis
10. Learn three facets of Web analyticsâcontent, structure, and usage mining
11. Know social analytics including social media and social network analyses
CHAPTER OVERVIEW
This chapter provides a comprehensive overview of text analytics/mining and Web analytics/mining along with their popular application areas such as search engines, sentiment analysis, and social network/media analytics. As we have been witnessing in recent years, the unstructured data generated over the Internet of Things (IoT) (Web, sensor networks, radio-frequency identification [RFID]âenabled supply chain systems, surveillance networks, etc.) are increasing at an exponential pace, and there is no indication of its slowing down. This changing nature of data is forcing organizations to make text and Web analytics a critical part of their business intelligence/analytics infrastructure.
CHAPTER OUTLINE
7.1 Opening Vignette: Amadori Group Converts Consumer Sentiments into
Near-Real-Time Sales
7.2 Text Analytics and Text Mining Overview
7.3 Natural Language Processing (NLP)
7.4 Text Mining Applications
7.5 Text Mining Process
7.6 Sentiment Analysis
7.7 Web Mining Overview
7.8 Search Engines
7.9 Web Usage Mining
7.10 Social Analytics
ANSWERS TO END OF SECTION REVIEW QUESTIONS( ( ( ( ( (
Section 7.1 Review Questions
1. According to the vignette and based on your opinion, what are the challenges that the food industry is facing today?
Student perceptions may vary, but some common themes related to the challenges faced by the food industry could include the changing nature and role of food in peopleâs lifestyles, the shift towards pre-prepared or easily prepared food, and the growing importance of marketing to keep customers interested in brands.
2. How can analytics help businesses in the food industry to survive and thrive in this competitive marketplace?
Analytics can serve dual purposes by both tracking customer interest in the brand as well as providing valuable feedback on customer preferences. An analytics system can be used to evaluate the traffic to various brand marketing campaigns (website or social) that play a pivotal role in ensuring that products are being shown to new pot.
Text Mining is the technique that helps users to find out useful information from a large amount of text documents on the web or database. Most popular text mining and classification methods have adopted term-based approaches. The term based approaches and the pattern-based method describing user preferences. This review paper analyse how the text mining work on the three level i.e sentence level, document level and feature level. In this paper we review the related work which is previously done. This paper also demonstrated that what are the problems arise while doing text mining done at the feature level. This paper presents the technique to text mining for the compound sentences.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
Â
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Â
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges â from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Â
Building better applications for business users with SAP Fiori.
⢠What is SAP Fiori and why it matters to you
⢠How a better user experience drives measurable business benefits
⢠How to get started with SAP Fiori today
⢠How SAP Fiori elements accelerates application development
⢠How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
⢠How SAP Fiori paves the way for using AI in SAP apps
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Â
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
DevOps and Testing slides at DASA ConnectKari Kakkonen
Â
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Â
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
Â
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Â
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navyâs DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATOâs (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Â
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAGâs diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of âhallucinationsâ and improving the overall customer journey.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Â
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager â Modern Workplace, Uni Systems
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
Â
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
PHP Frameworks: I want to break free (IPC Berlin 2024)
Â
Table of Content Detection using Machine Learning: Proposed System
1. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4, No.3, May 2013
DOI : 10.5121/ijaia.2013.4302 13
Table of Content Detection using Machine
Learning: Proposed System
Rachana Parikh1
and Prof. Avani vasant2
1
PG Student, C.E Dept., V.V.P. Engineering College, Rajkot, G.T.U
rachnaparikh88@gmail.com
2
H.O.D, I.T Dept., V.V.P Engineering College, Rajkot, G.T.U
avanivasant@yahoo.com
ABSTRACT
Table of content (TOC) detection has drawn attention now a day because it plays an important role in
digitization of multi-page document. Generally book document is multi-page document. So it becomes
necessary to detect Table of Content page for easy navigation of multi-page document and also to make
information retrieval faster for desirable data from the multipage document. All the Table of content pages
follow the different layout, different way of presenting the contents of the document like chapter, section,
subsection etc. This paper introduces a new method to detect Table of content using machine learning
technique with different features. With the main aim to detect Table of Content pages is to structure the
document according to their contents.
KEYWORDS
Digital Document Library, Table of Content detection, Xml conversion, feature extraction, Decision
tree.
1. INTRODUCTION
As todayâs era turns to digital from the physical object, all the documents like news papers,
magazines and book documents are available digitally from digital library and it becomes a
common need of all of us. All the books available digitally, are formatted with table of content
for their content organization. In other words TOC is simply collection of references to individual
articles of a document no matter what layout it follows.
Table of content includes the titles or description of section, and subsection. Along with that it
also shows the order in which the content appear. It is a great way of organizing content of
multipage document. It is also easy way for users to navigate the pages of document.
For easy retrieval of the intended portions as demanded by users, users need to go through table
of content page. So it requires detecting the table of content page. Generally all the table of
Content pages has different layout structure. So detection of Table of content pages is very
exciting. TOC detection can be applied to the magazine, digital library and also to the book
documents with different method and different algorithms available.
Based on general observed feature for detecting TOC pages are follows:
1. Page number-related heuristics and page Images.
2. Chapter, section and subsection are the structural element used to extract the Indentation and
font size of multi-page document.
2. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
14
3. Predefined connector such as dot lines.
4. Headline and decorative element matching.
5. For Chinese book geometrical rules and semantic rules are combined.
6. Labeling approach to delimit articles.
7. Probabilistic relaxation through training on the layout of several representative samples.
In this paper we have proposed a system to detect table of content pages using machine learning
approach. A decision tree algorithm is applied for table of content detection. The main aim is to
achieve high accuracy for detection.
2. LITERATURE REVIEW
There are various techniques available for the Table of content detection.TOC can be detected
with different approach with different features. Different approach can be combined for new
approach discovery. For the discovery of new approach, different approaches can be combined.
S.Mandal et al[1] proposes a method of TOC detection based on morphology based algorithm.
Here the first step of identification of TOC based on finding page numbers associated with the
name of sections, sub-sections or chapter. In this method first step is to find word halos then
finding the right-most word and then TOC is detected. This method is based on finding the page
number and other text layouts using decision tree.
Adna Ul-Hasan et al[2] proposes a method of TOC detection is based on OCR free detection. The
first step is to binarized document image. Then after digits and non-digit are extracted using MLP
classifiers. Then distribution of digits on a page is estimated by vertical projection. A typical TOC
page exhibits a vertical peak in projection profile. The width of projection profile is then
determined to decide whether the input page is a TOC or not.
Shinji Tsuruoka et al[3] proposes a method that consists of pre-processing, structure
understanding and final is to convert to xml formate.xml generation is necessary to understand
structure of document pages.
3. APPLICATIONS OF TOC DETECTION
Adna Ul-Hasan et al [2] shows that the identified TOC is useful for improving the accuracy of
other document understanding techniques, for example:
1. Advertisement Detection
The identified TOC can be fed to the advertisement finder component to identify advertisement
pages .In this type of application TOC pages can be used to:
a) Mark pages that appear in the TOC as article page,
b) Generally pages, before TOC page are advertisement pages.
2. Title Detection
A TOC page provides the way to find the article title from the text of TOC pages. Generally
most significant title i.e. editor wants to use to attract the readerâs attention is placed in TOC
page.
3. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
15
3. For easy retrieval of information as demanded by user.
As user want to retrieve specific information we can locate that information from Table of
Content page. So retrieval of demanded information becomes easier.
4. Navigation to specific page document is also become easy and fast.
As user want to retrieve specific information from the specific page number, navigation to that
specific page number becomes easy. All the topics are organized in Table of content page with
page number.
5. Table of content pages can also be detected from magazines and journals.
Table of content pages can also be detected from the different multipage document which follows
different layout of content. e.g. magazine and journal have different layout than book document.
6. Avoiding Ctrl-F for finding specific topic from multi-page document.
Generally in e-books Table of content are searched by ctrl-F.so whenever it is required to search
content, we need to go though ctrl-F for searching, which is disabled in some documents that are
mere scanned copies. So proposed system provides an easy way to search content automatically.
7. For custom publication of the book we have to detect ToC page for customized
the book.
Custom publication provides the facility to merge the different module from different
book. So for merging the chapter we need to detect table of content page.
4. PROBLEM STATEMENT
Main aim is to detect Table of Contents automatically. For automatic detection of Table of
Content, we apply machine learning Technique. Specifically Decision tree algorithm will be
applied on e-books or digital libraries. Finally we calculate the overall accuracy using various
attributes like presence of title term, font type of title term, Font class of Title term, Number of
contextual term, Normalized frequency of Toc section term, Normalized frequency of line start
with number, Normalized frequency of line end with number, Are all the English numbers in
ascending order and normalized line position of title term.
Many different approaches are available to identify the TOC pages of multi-page document.
Majority of the approach are for specific formats features to identify TOC pages for specific class
of document only. Like decorative element such as page numbering, chapter, section and
subsection numbering formats or indentation detection. Some of the approaches are language
specific. So varieties of different layouts are available. As TOC pages are free from specific
format. It is challenging task to detect TOC pages having different layout formats [6].
5. CHARACTERISTICS OF TOC PAGE
1. Generally TOCsâ are available as a full page in books and report. It also may be a part of a
page; some journals TOCs are available with logo and headings and all other texts.
4. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
16
2. Generally rightmost columns contain the digits i.e. page numbers. But it is not necessary that
this format is always maintained in each document. It may also be possible that page number
appears just after the name of the sections and subsection headings.
3. Leftmost columns usually have the section and subsection numbers. The world gap between
the title and section number is much more than the normal word gaps.
4. Dotted lines or thin lines appear at regular intervals in the TOC pages. This dotted lines or
thin line show the correlations between the end of the text line and the corresponding page
numbers at the right-most columns.
5. TOC page may have multiline title. Many TOCs are available with horizontal separator lines.
6. TOCs have characteristics of similar to the tabular Structure.
7. Page numbers are short mostly 3 for average books[5]
6. PROPOSED APPROACH
As shown in fig-1 there are various approach available for machine learning technique for
implementing the system. All the chosen approaches are described below.
A.Machine Learning
Machine learning is define as an art of developing computer programs in such a way that can
teach themselves to grow and change when exposed to new data without being explicitly
programmed.
The work is to detection of ToC pages automatically so machine learning provides the efficient
way for automation.
Fig-1 Flow of Proposed approach
A Classification
Classification is a task of assigning class label to set of unclassified cases, on the basis of training
set.
ClassiďŹcation is a formal method for repeatedly making judgments in new situations.
5. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
17
Here work is to detect ToC pages from e-book, so pages will be classified into ToC and Non-ToC
pages.
B Supervised Learning
Supervised learning means the results or the possible outcomes are already known in advanced
from the training set of data.
Here the work will be done with first training the system and then system will learn by itself. So,
in training process the possible outcomes will be well known in advance.
C Decision Tree
Decision tree is a structure that can be used to divide up a large collection of records into
successively smaller sets of records by applying a sequence of simple decision rules.
Decision tree provides human interpretability so it will be easy to train the system.
7. FLOW OF PROPOSED SYSTEM
1. Having digital library with offline pdfâs or another document.
2. Choose 15-30% pages of single document pages.
3. Labeled TOC and NON-TOC accordingly.
4. Train the system with various attribute of the page layout.
5. Feed some new document to test the system
6. Observe the learned data for system performance.
As shown in fig-2 proposed system uses the digital library or offline pdfâs or any multipage
document as input. For the Table of content detection first 15-30% pages are selected from the
entire document and given to the system as input. All the selected pages are labeled as ToC and
NON-ToC accordingly. Labelling is done manually for the training of the system correctly. Train
the system with different attribute of table of content detection. Table of content pages have
several different attributes.
Fig-2 Flow of Proposed system
6. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
18
All the Table of Content pages follow different layout for presenting their content. With different
attribute train the system for the detection. In other words training is done by labelling .So system
can easily learn from the training i.e. from labelled data. After learning from labelled data system
will be tested for correct detection. Observe the input and output data for system testing. Match
the numbers of correctly detected and actual Table of content pages for the performance of the
system.
8. PROPOSED ALGORITHM
Step-1:-Take some pdf or multipage document for detecting ToC pages.
Step-2 :-Convert Pdf or multipage document to XML.
Step-3:- Read the XML file to extract the feature.
Step-4:- Extract the following feature to detect the ToC page
I. Presence of ToC Title term.
II. Font type of ToC Title term.
III. Font class of ToC Title term.
IV. Number of Contextual terms.
V. Number of section term.
VI. Line position of Title term.
VII. Line start with number.
VIII. Line end with number.
IX. Are English numbers in ascending order?
Step-5:-Construct an output file with the extracted features from document for decision tree
generation.
Step-6:-Label the training data with class label (ToC/NONToC).
Step-7:-Learn a decision tree out of the training data.
Step-8:-For unseen data
(a)Follow step-1 to step-6
(b)Apply decision tree learning step-7 to find class labels.
Here some pdf or multipage documents that are with table of content page is selected for Table of
content page detection. For detection of table of content page we need to extract the features from
the pdf or multipage document. For extracting feature we need to convert document in structured
format. So selected document is converted into XML formate.XML provides the structured
format with different layout related attribute. Converted Xml is used to read the data contained in
that document.
There are many features available for the detection of table of content page from the multipage
document. Layout based, number based and term based attributes. Proposed system uses mixture
of all the type of attribute that are term based and number based. The selected attribute for
detection of table of content are as follows.
1. Presence of title term means need to check for the title term i.e. Table of Content or Content
word. Most of the tables of content pages are with term Table of content or content word
only.
2. If title term exists then type of title term will be different then other words. Most of the title
terms are in Bold. So it can be easily differentiable.
3. Same as font type, title term has different class of font. Generally it is in Times New Roman.
It may also differ document to document.
7. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
19
4. For detecting the actual title term, need to look for the neighbors of the title term i.e.
contextual terms. If its successor and predecessor does not exists then it is title term only.
5. Normalized frequency of section term. It means frequency of section term is higher than the
possibility of page is higher for the table of content page.
6. Line position of the title term means if title term exists then need to look for the line number
in which line the content word exits. If title term exits in starting line then that page is title
term otherwise not.
7. Normalized frequency of line start with number means subsection have points indicated by
number so it can be easily be shown that it may be title term. E.g.1.1, 1.2, 1.3 etc all are the
number from which line can be started.
8. Normalized frequency of line end with number that means all the topics and subtopic
display the page numbers. e.g.1,2, 3,4,5,6 etc.
9. Are English numbers in ascending order? Means all the English number i.e. all the page
numbers and section numbers are always in ascending order.
After extracting the feature need to generate the output file with all the attribute values for
decision tree generation. An output contains attribute values from which decision tree will be
generated
Label the training data with class label as TOC or NON-TOC accordingly for the training the
system. System will learn from the class labels. For unseen data, follow the entire procedure.
The proposed algorithm can be used for most of the different page layout. It provides the high
accuracy of correctly detected Table of content pages. All the proposed features are unbiased for
particular term or number. For the implementation of algorithm various tools are available.
9. Training dataset
Various attributes are used for detecting table of content page. following are the attributes used to
train the system for table of content detection. These training set is sample training set for
proposed system. According to the training set decision tree will be generated. from the decision
tree we can easily understand that what system is learning.
8. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
20
From the training dataset values of the different features are different. Three of them uses the
normalized frequency of the presence of the attribute or feature for detecting the table of content
page. Rest of the attribute having the different values according. From the training set attribute
value decision tree is generated.
10.Decision tree generation
Decision tree is generated automatically from the dataset. According to the dataset value decision
tree may vary. System will learn from the training dataset and generate decision tree. After
training the system, system will learn by itself from the training. So table of content is detected
automatically by system. The decision tree for the given above dataset is shown below.
Fig-3 Decision tree for Dataset
As shown in decision tree all the attribute are set according their normalized frequency. Each and
every time the decision tree is generated according to attribute value. Decision tree is human
interpretable so that we can easily understand what machine is learning from our different inputs.
11. CONCLUSION
In this paper we have pros with posed an approach for TOC detection using decision tree
algorithm with machine learning technique. For detecting Table of Content various attributes are
to be derived like presence of title term, font type of title term, Font class of Title term, Number
of contextual term, Normalized frequency of Toc section term, Normalized frequency of line start
with number, Normalized frequency of line end with number, Are all the English numbers in
ascending order and normalized line position of title term, all the attributes are also specified and
explain. These attributes will be applied for table of content detection. The future work is to
implement the proposed system and measure the accuracy of Table of content detection.
REFERENCES
[1] S.Mandal, S.P.Chowdhury, A.K.Das, and B.Chanda. âAutomated Detection and segmentation of
Table of Contents Page from Document Imagesâ, Proc. Of the ICDARâ03, IEEE Computer Society
Press,Edinburgh,2003,pp,398-402.
9. International Journal of Artificial Intelligence & Applications (IJAIA), Vol.4,No.3, May 2013
21
[2] Adnan Ul-Hasan, Syed Saqib Bukhari, Faisal Shafaity, and Thomas M. Breuel, âOCR-Free Table of
Contents Detection in Urdu Booksâ
[3] Shinji Tsuruoka, Chihiro Hirano, Tomohiro Yoshikawa and Tsuyoshi Shinogi, âImage-based
Structure analysis for a Table of Contents and Conversion to XML Documentsâ.
[4] L.Gao, Z.Tang, X.Lin, X.Tao, Y. Chu. âAnalysis of Book Documentâs Table of Content Based on
clusteringâ, Proc of the ICDARâ09, IEEE computer science,pp2-3
[5] S.Yacoub and J.A.Peiro, âIdentification of Document Structure and Table of Content in Magazine
Archivesâ,Proc of the ICDARâ05,IEEE Computer Society press,seoul,2005,pp1253-1257.
[6] H.Dejean, and J.L Meunier. âStructuring documents According to Their Table of Cotentsâ. Proc.off
Symposium on docEngâ05, ACM, Bristol.2005,pp2-9.
[7] Xiaofan Lin, Yan Xiong, âDetection and Analysis of Table of Contents Based on Content
Associationâ, Imaging Systems Laboratory, HP Laboratories Palo Alto,HPL-2005-105,May 25,
2005
[8] A.K. Das and B. Chanda. Detection of tables and headings from document image: A
morphological approach. InIn-ternational Conf. on Computational linguistics, Speech and
Document Processing (ICCLSDPâ98); Feb. 18â20, Calcutta, India, pages A57âA64, 1998J
[9] F. Le Bourgeois, H. Emptoz, S. Souafi Bensafi, âDocument Understanding Using Probabilistic
Relaxation:Application on Tables of Contents of Periodicalsâ, 0-7695-1263-1/01/$10.00 0 2001 IEEE
[10] Qiii Luo, Toyohide Watanabe, Takeshi Nakayama, âIdentifying Contents Page of Documentsâ, 1015-
4651/96 $5.00 0 1996 IEEE Proceedings of ICPR â96
Biography
Rachana P. Parikh is a PG scholar from Vyavasayi Vidya Pratishthan Engineering
College, Gujarat Technological University, Gujarat, India. Her area of interest is Artificial
Intelligence.rachnaparikh88@gmail.com
Prof.Avani Vasant is an assistant professor And Head of Information Technology
department in Vyavsayi Vidya Pratishthn Engineering College, Rajkot, Gujarat, India. She
is having more than 13 years of teaching Experience. Her area of Interest is Artificial
Intelligence.avanivasant@yahoo.com