Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
This paper is focused towards extraction of important words from conversations, and then these
words are utilized to recover a small number of relevant documents, which can be given to users as per
their needs. In any case, even a short audio fragment contains of large number of words, which are
conceivably identified with a few topics and also, using automatic speech recognition (ASR) framework
reduces errors in the output. It is difficult to provide the data needs of the conversation members
appropriately. A calculation to remove decisive words from the yield of an ASR framework (or a manual
transcript for testing) to support the potentially differing qualities of subjects and decrease ASR
commotion is proposed. At that point, a technique is used to make many implicit queries from the selected
keywords. These queries will in return produce list of relevant documents. The scores demonstrate that our
proposition is modified over past systems that consider just word re-occurrence or topic closeness, and
speaks to an appropriate answer for a report recommender framework to be utilized as a part of
discussions.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Worldranking universities final documentationBhadra Gowdra
With the upcoming data deluge of semantic data, the fast growth of ontology bases has brought significant challenges in performing efficient and scalable reasoning. Traditional centralized reasoning methods are not sufficient to process large ontologies. Distributed searching methods are thus required to improve the scalability and performance of inferences. This paper proposes an incremental and distributed inference method for large-scale ontologies by using Map reduce, which realizes high-performance reasoning and runtime searching, especially for incremental knowledge base. By constructing transfer inference forest and effective assertion triples, the storage is largely reduced and the search process is simplified and accelerated. We propose an incremental and distributed inference method (IDIM) for large-scale RDF datasets via Map reduce. The choice of Map reduce is motivated by the fact that it can limit data exchange and alleviate load balancing problems by dynamically scheduling jobs on computing nodes. In order to store the incremental RDF triples more efficiently, we present two novel concepts, i.e., transfer inference forest (TIF) and effective assertion triples (EAT). Their use can largely reduce the storage and simplify the reasoning process. Based on TIF/EAT, we need not compute and store RDF closure, and the reasoning time so significantly decreases that a user’s online query can be answered timely, which is more efficient than existing methods to our best knowledge. More importantly, the update of TIF/EAT needs only minimum computation since the relationship between new triples and existing ones is fully used, which is not found in the existing literature. In order to store the incremental RDF triples more efficiently, we present two novel concepts, transfer inference forest and effective assertion triples. Their use can largely reduce the storage and simplify the searching process.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
This paper is focused towards extraction of important words from conversations, and then these
words are utilized to recover a small number of relevant documents, which can be given to users as per
their needs. In any case, even a short audio fragment contains of large number of words, which are
conceivably identified with a few topics and also, using automatic speech recognition (ASR) framework
reduces errors in the output. It is difficult to provide the data needs of the conversation members
appropriately. A calculation to remove decisive words from the yield of an ASR framework (or a manual
transcript for testing) to support the potentially differing qualities of subjects and decrease ASR
commotion is proposed. At that point, a technique is used to make many implicit queries from the selected
keywords. These queries will in return produce list of relevant documents. The scores demonstrate that our
proposition is modified over past systems that consider just word re-occurrence or topic closeness, and
speaks to an appropriate answer for a report recommender framework to be utilized as a part of
discussions.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Worldranking universities final documentationBhadra Gowdra
With the upcoming data deluge of semantic data, the fast growth of ontology bases has brought significant challenges in performing efficient and scalable reasoning. Traditional centralized reasoning methods are not sufficient to process large ontologies. Distributed searching methods are thus required to improve the scalability and performance of inferences. This paper proposes an incremental and distributed inference method for large-scale ontologies by using Map reduce, which realizes high-performance reasoning and runtime searching, especially for incremental knowledge base. By constructing transfer inference forest and effective assertion triples, the storage is largely reduced and the search process is simplified and accelerated. We propose an incremental and distributed inference method (IDIM) for large-scale RDF datasets via Map reduce. The choice of Map reduce is motivated by the fact that it can limit data exchange and alleviate load balancing problems by dynamically scheduling jobs on computing nodes. In order to store the incremental RDF triples more efficiently, we present two novel concepts, i.e., transfer inference forest (TIF) and effective assertion triples (EAT). Their use can largely reduce the storage and simplify the reasoning process. Based on TIF/EAT, we need not compute and store RDF closure, and the reasoning time so significantly decreases that a user’s online query can be answered timely, which is more efficient than existing methods to our best knowledge. More importantly, the update of TIF/EAT needs only minimum computation since the relationship between new triples and existing ones is fully used, which is not found in the existing literature. In order to store the incremental RDF triples more efficiently, we present two novel concepts, transfer inference forest and effective assertion triples. Their use can largely reduce the storage and simplify the searching process.
This document discusses building an inverted index to efficiently support information retrieval on large document collections. It describes tokenizing documents, building a dictionary of normalized terms, and creating postings lists that map each term to the documents it appears in. Inverted indexes allow skipping linear scanning and support flexible queries by indexing term locations. The document also covers calculating precision and recall to measure system effectiveness.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
In this paper, four indices will be built: inverted-file , signature-file, suffix array and suffix tree. However, to measure the performance of each one, a retrieval system must be built to compare the results of using these indices.
A collection of 242 Arabic Abstracts from the proceeding of the Saudi Arabian National Computer Conferences have been used in these systems, and a collection of 60 Arabic queries have been run on the there systems. We found out that the retrieval result for inverted files is better than the retrieval result for other indices.
An Evaluation and Overview of Indices Based on Arabic DocumentsIJCSEA Journal
The paper aims at giving an overview about inverted files , signature files, suffix array and suffix tree based on Arabic documents collection. The paper also aims at giving the comparison points between all these techniques and the performance of this techniques on each of the comparison points. Any information retrieval System is usually evaluated through efficiency and effectiveness of this system. Moreover, there are two aspects of efficiency: Time and Space. The time measure represents the time needed to retrieve a document relevant to a specified query, while space represents the capacity of memory needed to create the two indices.
Topic Mining on disaster data (Robert Monné)Robert Monné
This document describes applying topic modeling to analyze unstructured disaster response documents. It outlines preprocessing PDF documents into text format by converting to a corpus and then splitting into segments based on table of contents headings. Topic modeling is then used to extract topics from the text segments to help identify relevant information for disaster responders. The goal is to create a replicable text mining method that can quickly analyze documents and extract critical information for decision making during disaster response.
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
This paper focus on keyphrase extraction for news articles because news article is one of the popular document genres on the web and most news articles have no author-assigned keyphrases. Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. Experimental results demonstrate the good effectiveness and robustness of our proposed approach. According to experiments conducted on several documents and keyphrases we consider top 10 keyphrases are most suitable keyphrases.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
The document discusses 10 common questions that may be asked in a Hadoop technical interview. It provides definitions for big data and the four V's of big data (volume, variety, veracity, velocity). It also discusses how businesses use big data analytics to increase revenue, examples of companies that use Hadoop, the difference between structured and unstructured data, the concepts that Hadoop works on (HDFS and MapReduce), core Hadoop components, hardware requirements for running Hadoop, common input formats, and some common Hadoop tools. Overall, the document outlines essential information about big data and Hadoop that may be helpful to review for a technical interview.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for fault-tolerant storage and MapReduce as a programming model for distributed computing. HDFS stores data across clusters of machines and replicates it for reliability. MapReduce allows processing of large datasets in parallel by splitting work into independent tasks. Hadoop provides reliable and scalable storage and analysis of very large amounts of data.
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
This document presents a methodology for a robust keyword-based document retrieval system utilizing advanced encryption. Key aspects of the methodology include:
1) Performing stop word removal and term frequency analysis to create feature vectors for documents.
2) Assigning unique numbers to terms to create a dictionary and document codes for identification and comparison.
3) Using advanced encryption techniques like substitution and mixing on the document codes.
4) Comparing the codes of user queries to document codes to find and rank the most relevant documents.
The methodology is tested on real and artificial datasets, showing improved accuracy, precision, and recall over previous methods according to experimental results.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
This document proposes a BOT virtual guide that will extract educational web content based on topics recently taught using web crawling techniques. It will use a domain ontology, DOM parsing, and concept-focused crawling to find relevant documents from the web. The documents will be ranked based on their concept similarity to the topic. The filtered and crawled data will then be provided to students as speech output through a text-to-speech system to serve as an automated virtual guide for supplemental learning materials.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
The document summarizes the vector space model for scoring and ranking documents in response to a query in an information retrieval system. It explains that in this model, documents and queries are represented as vectors in a common vector space. The similarity between a document and query vector is measured by calculating the cosine similarity of the two vectors, which scores and ranks documents based on the terms they share with the query. It also describes how the vector space model allows retrieving the top K documents by relevance rather than using a Boolean retrieval model.
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
This document provides an overview of conceptual foundations and preprocessing steps for text mining. It discusses the differences between syntax and semantics in text, and presents a general framework for text analytics including preprocessing, representation, and knowledge discovery. For text representation, it describes bag-of-words models and vector space models, including frequency vectors, one-hot encoding, and TF-IDF weighting. It also provides an introduction to n-grams for representing sequential data.
The Study of the Large Scale Twitter on Machine LearningIRJET Journal
This document discusses a study on integrating machine learning tools into a Hadoop-based analytics platform at Twitter. Specifically, it focuses on extending the Pig dataflow language to provide predictive analytics capabilities using machine learning algorithms like logistic regression and decision trees. Key points discussed include developing storage functions in Pig for both batch and online learning algorithms, training models by streaming training data through the functions, and scaling out training via model ensembles. An example application of sentiment analysis on tweets using these new Pig extensions is also described to illustrate the end-to-end workflow.
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...IJCSEA Journal
The document proposes a new algorithm called MS-PISA for mining sequential patterns from progressive databases that have multiple minimum support thresholds. MS-PISA uses a tree structure called MMP-tree to store information about the database and discovered patterns. The MMP-tree tracks the percentage of participation of each itemset based on the minimum support thresholds. MS-PISA progressively updates the MMP-tree as new data arrives to find sequential patterns that satisfy the varying minimum support requirements.
This document proposes a range query algorithm for big data based on MapReduce and P-Ring indexing. It introduces a dual-index structure that uses an improved P-Ring to index files across nodes, and B+ trees to index attribute values within files. The algorithm divides range queries into two steps: using P-Ring to search for relevant files, then searching within those files' B+ trees to find matching data. It describes how the P-Ring and B+ trees are constructed and used to process range queries in a distributed manner optimized for big data environments.
This document discusses text mining and information extraction. It covers the goals of information extraction including extracting structured data from unstructured text. It also discusses named entity recognition, challenges in NER, maximum entropy methods for NER, template filling using statistical and finite-state approaches, and applications of information extraction.
The document discusses various techniques for dimensionality reduction and analysis of text data, including latent semantic indexing (LSI), locality preserving indexing (LPI), and probabilistic latent semantic analysis (PLSA). LSI uses singular value decomposition to project documents into a lower-dimensional space while minimizing reconstruction error. LPI aims to preserve local neighborhood structures between similar documents. PLSA models documents as mixtures of underlying latent themes characterized by multinomial word distributions.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of papers. Text mining can help address this by automatically analyzing papers and identifying relevant information. However, text mining also faces challenges like dealing with unstructured text, word ambiguities, and noisy data. The basic steps of text mining involve preprocessing text through tasks like tokenization, feature selection to identify important terms, and parsing to separate words and punctuation.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
Twitter word frequency count using hadoop components 150331221753pradip patel
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system
Twitter word frequency count using hadoop components 150331221753pradip patel
Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation time-effective and feasible to execute on a single system.
Topic Mining on disaster data (Robert Monné)Robert Monné
This document describes applying topic modeling to analyze unstructured disaster response documents. It outlines preprocessing PDF documents into text format by converting to a corpus and then splitting into segments based on table of contents headings. Topic modeling is then used to extract topics from the text segments to help identify relevant information for disaster responders. The goal is to create a replicable text mining method that can quickly analyze documents and extract critical information for decision making during disaster response.
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
This paper focus on keyphrase extraction for news articles because news article is one of the popular document genres on the web and most news articles have no author-assigned keyphrases. Existing methods for single document keyphrase extraction usually make use of only the information contained in the specified document. This paper proposes to use a small number of nearest neighbor documents to provide more knowledge to improve single document keyphrase extraction. Experimental results demonstrate the good effectiveness and robustness of our proposed approach. According to experiments conducted on several documents and keyphrases we consider top 10 keyphrases are most suitable keyphrases.
This document provides an overview of information retrieval models, including vector space models, TF-IDF, Doc2Vec, and latent semantic analysis. It begins with basic concepts in information retrieval like document indexing and relevance scoring. Then it discusses vector space models and how documents and queries are represented as vectors. TF-IDF weighting is explained as assigning higher weight to rare terms. Doc2Vec is introduced as an extension of word2vec to learn document embeddings. Latent semantic analysis uses singular value decomposition to project documents to a latent semantic space. Implementation details and examples are provided for several models.
10 Popular Hadoop Technical Interview QuestionsZaranTech LLC
The document discusses 10 common questions that may be asked in a Hadoop technical interview. It provides definitions for big data and the four V's of big data (volume, variety, veracity, velocity). It also discusses how businesses use big data analytics to increase revenue, examples of companies that use Hadoop, the difference between structured and unstructured data, the concepts that Hadoop works on (HDFS and MapReduce), core Hadoop components, hardware requirements for running Hadoop, common input formats, and some common Hadoop tools. Overall, the document outlines essential information about big data and Hadoop that may be helpful to review for a technical interview.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of commodity hardware. It uses HDFS for fault-tolerant storage and MapReduce as a programming model for distributed computing. HDFS stores data across clusters of machines and replicates it for reliability. MapReduce allows processing of large datasets in parallel by splitting work into independent tasks. Hadoop provides reliable and scalable storage and analysis of very large amounts of data.
A Robust Keywords Based Document Retrieval by Utilizing Advanced Encryption S...IRJET Journal
This document presents a methodology for a robust keyword-based document retrieval system utilizing advanced encryption. Key aspects of the methodology include:
1) Performing stop word removal and term frequency analysis to create feature vectors for documents.
2) Assigning unique numbers to terms to create a dictionary and document codes for identification and comparison.
3) Using advanced encryption techniques like substitution and mixing on the document codes.
4) Comparing the codes of user queries to document codes to find and rank the most relevant documents.
The methodology is tested on real and artificial datasets, showing improved accuracy, precision, and recall over previous methods according to experimental results.
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those
applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be
established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse
document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished
This document proposes a BOT virtual guide that will extract educational web content based on topics recently taught using web crawling techniques. It will use a domain ontology, DOM parsing, and concept-focused crawling to find relevant documents from the web. The documents will be ranked based on their concept similarity to the topic. The filtered and crawled data will then be provided to students as speech output through a text-to-speech system to serve as an automated virtual guide for supplemental learning materials.
This document provides an overview of text mining and web mining. It defines data mining and describes the common data mining tasks of classification, clustering, association rule mining and sequential pattern mining. It then discusses text mining, defining it as the process of analyzing unstructured text data to extract meaningful information and structure. The document outlines the seven practice areas of text mining as search/information retrieval, document clustering, document classification, web mining, information extraction, natural language processing, and concept extraction. It provides brief descriptions of the problems addressed within each practice area.
This document discusses several key differences between traditional databases and Hive. Hive uses a schema-on-read model where the schema is not enforced during data loading, making the initial load much faster. However, this impacts query performance since indexing and compression cannot be applied during loading. Pig Latin is a data flow language where each step transforms the input relation, unlike SQL which is declarative. While Hive originally lacked features like updates, transactions and indexing, the developers are working to integrate HBase and improve support for these features.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
The document summarizes the vector space model for scoring and ranking documents in response to a query in an information retrieval system. It explains that in this model, documents and queries are represented as vectors in a common vector space. The similarity between a document and query vector is measured by calculating the cosine similarity of the two vectors, which scores and ranks documents based on the terms they share with the query. It also describes how the vector space model allows retrieving the top K documents by relevance rather than using a Boolean retrieval model.
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
This document provides an overview of conceptual foundations and preprocessing steps for text mining. It discusses the differences between syntax and semantics in text, and presents a general framework for text analytics including preprocessing, representation, and knowledge discovery. For text representation, it describes bag-of-words models and vector space models, including frequency vectors, one-hot encoding, and TF-IDF weighting. It also provides an introduction to n-grams for representing sequential data.
The Study of the Large Scale Twitter on Machine LearningIRJET Journal
This document discusses a study on integrating machine learning tools into a Hadoop-based analytics platform at Twitter. Specifically, it focuses on extending the Pig dataflow language to provide predictive analytics capabilities using machine learning algorithms like logistic regression and decision trees. Key points discussed include developing storage functions in Pig for both batch and online learning algorithms, training models by streaming training data through the functions, and scaling out training via model ensembles. An example application of sentiment analysis on tweets using these new Pig extensions is also described to illustrate the end-to-end workflow.
MMP-TREE FOR SEQUENTIAL PATTERN MINING WITH MULTIPLE MINIMUM SUPPORTS IN PROG...IJCSEA Journal
The document proposes a new algorithm called MS-PISA for mining sequential patterns from progressive databases that have multiple minimum support thresholds. MS-PISA uses a tree structure called MMP-tree to store information about the database and discovered patterns. The MMP-tree tracks the percentage of participation of each itemset based on the minimum support thresholds. MS-PISA progressively updates the MMP-tree as new data arrives to find sequential patterns that satisfy the varying minimum support requirements.
This document proposes a range query algorithm for big data based on MapReduce and P-Ring indexing. It introduces a dual-index structure that uses an improved P-Ring to index files across nodes, and B+ trees to index attribute values within files. The algorithm divides range queries into two steps: using P-Ring to search for relevant files, then searching within those files' B+ trees to find matching data. It describes how the P-Ring and B+ trees are constructed and used to process range queries in a distributed manner optimized for big data environments.
This document discusses text mining and information extraction. It covers the goals of information extraction including extracting structured data from unstructured text. It also discusses named entity recognition, challenges in NER, maximum entropy methods for NER, template filling using statistical and finite-state approaches, and applications of information extraction.
The document discusses various techniques for dimensionality reduction and analysis of text data, including latent semantic indexing (LSI), locality preserving indexing (LPI), and probabilistic latent semantic analysis (PLSA). LSI uses singular value decomposition to project documents into a lower-dimensional space while minimizing reconstruction error. LPI aims to preserve local neighborhood structures between similar documents. PLSA models documents as mixtures of underlying latent themes characterized by multinomial word distributions.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of papers. Text mining can help address this by automatically analyzing papers and identifying relevant information. However, text mining also faces challenges like dealing with unstructured text, word ambiguities, and noisy data. The basic steps of text mining involve preprocessing text through tasks like tokenization, feature selection to identify important terms, and parsing to separate words and punctuation.
This document discusses big data and Hadoop. It defines big data as large datasets that are difficult to process using traditional methods due to their volume, variety, and velocity. Hadoop is presented as an open-source software framework for distributed storage and processing of large datasets across clusters of commodity servers. The key components of Hadoop are the Hadoop Distributed File System (HDFS) for storage and MapReduce as a programming model for distributed processing. A number of other technologies in Hadoop's ecosystem are also described such as HBase, Avro, Pig, Hive, Sqoop, Zookeeper and Mahout. The document concludes that Hadoop provides solutions for efficiently processing and analyzing big data.
Twitter word frequency count using hadoop components 150331221753pradip patel
Abstract- Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper
implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and
MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation
time-effective and feasible to execute on a single system
Twitter word frequency count using hadoop components 150331221753pradip patel
Analysis of twitter data can be very useful for marketing as well as for promotion strategies. This paper implements word frequency count for 1.5+ million twitters dataset .The hadoop components such as Apache pig and MapReduce framework are used for parallel execution of the data. This parallel execution makes the implementation time-effective and feasible to execute on a single system.
This document contains the following:
1. An algorithm called the Weight Short Algorithm is proposed to determine the next neighboring element in a matrix of data with the lowest traversal value without transmitting hop count or neighbor information.
2. The algorithm traverses the matrix in an odd symmetrical pattern and checks for the next neighbor in the same row, same column, or diagonally. No repetitions of node pairs are allowed.
3. A data structure is presented to represent the nodes in the matrix growing in odd symmetry. The algorithm is then described to search for the next neighbor position by checking rows, columns, and diagonals based on this data structure.
IRJET- Determining Document Relevance using Keyword ExtractionIRJET Journal
This document describes a system that aims to search for and retrieve relevant documents from a large collection based on a user's query. It does this through three main components: keyword extraction, document searching, and a question answering bot. Keyword extraction is done using the TF-IDF algorithm to identify important words in documents. These keywords are stored in a database along with their TF-IDF weights. When a user submits a query, the system searches for documents containing keywords from the query and returns relevant results. It also includes a feedback mechanism for users to improve search accuracy over time. The goal is to deliver accurate search results quickly from large document collections.
Big Data Processing with Hadoop : A ReviewIRJET Journal
1. This document provides an overview of big data processing with Hadoop. It defines big data and describes the challenges of volume, velocity, variety and variability.
2. Traditional data processing approaches are inadequate for big data due to its scale. Hadoop provides a distributed file system called HDFS and a MapReduce framework to address this.
3. HDFS uses a master-slave architecture with a NameNode and DataNodes to store and retrieve file blocks. MapReduce allows distributed processing of large datasets across clusters through mapping and reducing functions.
The document describes an automatic text summarization system that uses both extractive and abstractive summarization techniques. It discusses pre-processing steps like stop word removal and sentence filtering. Key features for ranking sentences include title matching, term frequency-inverse document frequency (TF-IDF), and position of the sentence. The proposed system generates a summary by selecting the highest ranked sentences while maintaining overall meaning and readability. Room for future improvements include enhancing the stemming process and exploring additional ranking features.
This document discusses using Hadoop to parse and process resumes received in various formats like documents and text. It proposes a system that can extract data from large volumes of resumes automatically without human involvement. Key components include resume parsing using Apache Tika, storing resumes in HDFS, identifying skills and tags using MapReduce, and recommending jobs to candidates based on their profiles. The system aims to reduce manual handling of resumes and provide suggestions to users about missing information in an efficient distributed manner using Hadoop.
Data mining model for the data retrieval from central server configurationijcsit
A server, which is to keep track of heavy document traffic, is unable to filter the documents that are most
relevant and updated for continuous text search queries. This paper focuses on handling continuous text
extraction sustaining high document traffic. The main objective is to retrieve recent updated documents
that are most relevant to the query by applying sliding window technique. Our solution indexes the
streamed documents in the main memory with structure based on the principles of inverted file, and
processes document arrival and expiration events with incremental threshold-based method. It also ensures
elimination of duplicate document retrieval using unsupervised duplicate detection. The documents are
ranked based on user feedback and given higher priority for retrieval.
IRJET- Resume Information Extraction FrameworkIRJET Journal
The document discusses a framework for extracting information from resumes. Resumes are semi-structured documents that contain varying information like different fields, field names, and formats, making them difficult to parse. The proposed framework uses text mining and rule-based parsing to extract keywords from resumes, scores qualifications and skills, clusters the extracted information using DBSCAN, and classifies the resumes using gradient boosting machines. It aims to help recruiters filter and categorize large numbers of resumes more efficiently.
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
This document presents a novel approach for effective pattern matching in text mining. It discusses the limitations of existing term-based approaches, which suffer from problems like polysemy and synonymy. The proposed technique uses four processes - pattern deploying, pattern evolving, shuffling and offset refinement - to discover patterns from text documents. It evaluates patterns according to their distribution in documents and reduces the influence of ambiguous patterns. Experimental results show the proposed model outperforms other data mining methods by achieving a higher performance level for text mining tasks.
Mining academic social network is becoming increasingly necessary with the increasing amount of data. It
is a favorite topic of research for many researchers. The data mining techniques are used for the mining of
academic social networks. In this paper, we are presenting an efficient frequent item set mining technique
for social academic network. The proposed framework first processes the research documents and then the
enhanced frequent item set mining is applied to find the strength of relationship between the researchers.
The proposed method will be fast in comparison to older algorithms. Also it will takes less main memory
space for computation purpose.
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
The document proposes a new framework for efficient semantic search in large datasets. It aims to improve understanding of short texts by enriching them with concepts and related terms from a probabilistic knowledge base. A deep learning model using stacked autoencoders is designed to learn features from the enriched short texts and encode them into binary codes, allowing similarity searches. Experiments show the new approach captures semantics better than existing methods and enables applications like short text retrieval and classification.
This document summarizes an article on automatic document clustering. It discusses how documents are preprocessed using techniques like tokenization, stop word removal, and stemming before being clustered. Clustering algorithms group similar documents together based on similarity metrics like TF-IDF. Documents can be searched by name, extension, or keywords to retrieve them from the clustered storage in a faster and more organized manner. Future work may include clustering multimedia files and improving algorithm performance.
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
This document discusses approaches for mining frequent item sets on Apache Hadoop. It begins with an introduction to data mining and association rule mining. Association rule mining involves finding frequent item sets, which are items that frequently occur together. Apache Hadoop is then introduced as a framework for distributed processing of large datasets. Several algorithms for mining frequent item sets are discussed, including Apriori, FP-Growth, and H-mine. These algorithms differ in how they generate and count candidate item sets. The document then discusses how these algorithms can be implemented on Hadoop to take advantage of its distributed and parallel processing abilities in order to efficiently mine frequent item sets from large datasets.
Big data is a popular term used to describe the large volume of data which includes structured, semi-structured and unstructured
data. Now-a-days, unstructured data is growing in an explosive speed with the development of Internet and social networks like Twitter,Facebook
& Yahoo etc., In order to process such colossal of data a software is required that does this efficiently and this is where Hadoop steps in. Hadoop
has become one of the most used frameworks when dealing with big data. It is used to analyze and process big data. In this paper, Apache Flume
is configured and integrated with spark streaming for streaming the data from twitter application. The streamed data is stored into Apache
Cassandra. After retrieving the data, the data is going to be analyzed by using the concept of Apache Zeppelin. The result will be displayed on
Dashboard and the dashboard result is also going to be analyzed and validating using JSON
An Efficient Approach to Manage Small Files in Distributed File SystemsIRJET Journal
This document proposes an efficient approach to manage small files in distributed file systems. It discusses challenges in storing and retrieving a large number of small files due to their metadata size. The proposed system aims to improve performance by designing a new metadata structure that decreases original metadata size and increases file access speed. It presents a system architecture with clients, metadata server and data servers. The system classifies files into different storage types and combines existing techniques into a single architecture. It provides algorithms for reading local metadata and file data in two phases. The performance of the approach is evaluated on a distributed file system environment.
IRJET- Automated Document Summarization and Classification using Deep Lear...IRJET Journal
The document proposes a system that uses deep learning methods for automated document summarization and classification. It uses a recurrent convolutional neural network (RCNN) which combines a convolutional neural network and recurrent neural network to build a robust classifier model. For summarization, it employs a graph-based method inspired by PageRank to extract the top 20% of sentences from a document based on word intersections. The RCNN model achieved over 97% accuracy on classifying documents from various domains using their summaries. The system aims to speed up classification and make it more intuitive using automated summarization techniques with deep learning.
Information Upload and retrieval using SP Theory of IntelligenceINFOGAIN PUBLICATION
In today’s technology Cloud computing has become an important aspect and storing of data on cloud is of high importance as the need for virtual space to store massive amount of data has grown during the years. However time taken for uploading and downloading is limited by processing time and thus need arises to solve this issue to handle large data and their processing. Another common problem is de duplication. With the cloud services growing at a rapid rate it is also associated by increasing large volumes of data being stored on remote servers of cloud. But most of the remote stored files are duplicated because of uploading the same file by different users at different locations. A recent survey by EMC says about 75% of the digital data present on cloud are duplicate copies. To overcome these two problems in this paper we are using SP theory of intelligence using lossless compression of information, which makes the big data smaller and thus reduces the problems in storage and management of large amounts of data.
Similar to [IJET V2I3P7] Authors: Muthe Sandhya, Shitole Sarika, Sinha Anukriti, Aghav Sushila (20)
These days we have an increased number of heart diseases including increased risk of heart attacks. Our proposed system users sensors that allow to detect heart rate of a person using heartbeat sensing even if the person is at home. The sensor is then interfaced to a microcontroller that allows checking heart rate readings and transmitting them over internet. The user may set the high as well as low levels of heart beat limit. After setting these limits, the system starts monitoring and as soon as patient heart beat goes above a certain limit, the system sends an alert to the controller which then transmits this over the internet and alerts the doctors as well as concerned users. Also the system alerts for lower heartbeats. Whenever the user logs on for monitoring, the system also displays the live heart rate of the patient. Thus concerned ones may monitor heart rate as well get an alert of heart attack to the patient immediately from anywhere and the person can be saved on time.This value will continue to grow if no proper solution is found. Internet of Things (IoT) technology developments allows humans to control a variety of high-tech equipment in our daily lives. One of these is the ease of checking health using gadgets, either a phone, tablet or laptop. we mainly focused on the safety measures for both driver and vehicle by using three types of sensors: Heartbeat sensor, Traffic light sensor and Level sensor. Heartbeat sensor is used to monitor heartbeat rate of the driver constantly and prevents from the accidents by controlling through IOT.
ABSTRACT The success of the cloud computing paradigm is due to its on-demand, self-service, and pay-by-use nature. Public key encryption with keyword search applies only to the certain circumstances that keyword cipher text can only be retrieved by a specific user and only supports single-keyword matching. In the existing searchable encryption schemes, either the communication mode is one-to-one, or only single-keyword search is supported. This paper proposes a searchable encryption that is based on attributes and supports multi-keyword search. Searchable encryption is a primitive, which not only protects data privacy of data owners but also enables data users to search over the encrypted data. Most existing searchable encryption schemes are in the single-user setting. There are only few schemes in the multiple data users setting, i.e., encrypted data sharing. Among these schemes, most of the early techniques depend on a trusted third party with interactive search protocols or need cumbersome key management. To remedy the defects, the most recent approaches borrow ideas from attribute-based encryption to enable attribute-based keyword search (ABKS
This document reviews the behavior of reinforced concrete deep beams. Deep beams are defined as having a shear span to depth ratio of less than 5. The response of deep beams differs from regular beams due to the influence of shear deformations and stresses. Failure modes include flexure, flexural-shear, and diagonal cracking. Previous studies investigated factors affecting shear strength such as concrete strength, reinforcement, and loading conditions. Equations have been proposed to predict shear strength based on test results.
Subcutaneous administration of toluene to rabbits for 6 weeks resulted in significant increases in liver enzyme levels and histopathological changes in the liver tissue. Liver sections from toluene-treated rabbits showed congested central veins, flattening and vacuolation of hepatocytes, and disarrangement of hepatic architecture. In contrast, liver sections from control rabbits appeared normal. Toluene exposure is known to cause oxidative stress and damage cell membranes in the liver through its metabolism.
This document summarizes a research paper that proposes a system to analyze crop phenology (growth stages) using IoT to support parallel agriculture management. The system would use sensors to collect data on soil moisture, temperature, humidity and other parameters. This data would be input to a database. Then, a multiple linear regression model trained on past data would predict the optimal crop and expected yield based on the tested sensor data and parameters. This system aims to help farmers select crops and fertilization practices tailored to their specific fields' conditions.
This document summarizes a study that determined the liberation size of gold ore from the Iperindo-Ilesha deposit in Nigeria and assessed its amenability to froth flotation. Samples of the ore were collected and subjected to sieve analysis to determine particle size fractions. Chemical analysis found that the actual and economic liberation sizes were 45μm and 250μm, respectively. Froth flotation experiments at 45μm particle size and varying collector dosages achieved a maximum gold recovery of 78.93% at 0.3 mol/dm3 collector dosage, with concentrate grade of 115 ppm Au. These parameters will be used for further processing to extract gold from this deposit.
This document presents a proposal for an IOT-based intelligent baby care system with a web application for remote baby monitoring. The system uses sensors to automatically swing a cradle when a baby cries, sound alarms if the baby cries for too long or the mattress is wet, and sends alerts to a web page for parents to monitor the baby's status from anywhere via internet connection. The proposed system aims to help working parents manage childcare remotely using sensors, a Raspberry Pi, web camera, and cloud server to detect the baby's activities and notify parents through a web application on their phone.
This document discusses various sources of water pollution and new techniques being developed for water purification. It begins by outlining how water pollution occurs from industrial wastes like mining and manufacturing, agricultural runoff containing pesticides, and domestic waste. It then examines some specific pollutants in more depth from these sources. New techniques under research for water purification are also mentioned, with the goal of developing more affordable methods. The document aims to analyze the impact of pollutants on water and introduce promising new purification techniques.
This document summarizes a research paper on using big data methodologies with IoT and its applications. It discusses how big data analytics is being used across various fields like engineering, data management, and more. It also discusses how IoT enables the collection of massive amounts of data from sensors and devices. Machine learning techniques are used to analyze this big data from IoT and enable communication between devices. The document provides examples of domains where big data and IoT are being applied, such as healthcare, energy, transportation, and others. It analyzes the similarities and differences in how big data techniques are used across these IoT domains.
The document describes a proposed smart library automation and monitoring system using RFID technology. The system uses RFID tags attached to books and student ID cards. An RFID scanner reads the tags to automate processes like tracking student entry and exit, book check-in/check-out, and inventory management. This allows transactions to occur without manual intervention. The system also includes an Android app for students to search books and check availability. The goals are to streamline library operations, prevent unauthorized access, and help locate misplaced books. Raspberry Pi hardware and a MySQL database are part of the proposed implementation.
This document discusses congestion control techniques for vehicular ad hoc networks (VANETs). It first provides background on VANETs, noting their use of vehicle-to-vehicle communication to share information. Congestion can occur when there is a sudden increase in data from nodes in the network. The document then reviews different existing congestion control schemes, which vary in how they adjust source sending rates and handle transient congestion. It proposes a priority-based congestion control technique using dual queues, one for transit packets and one for locally generated packets. This approach aims to route packets along less congested paths when congestion is detected based on buffer occupancy.
This document summarizes a research paper that proposes applying principles of Vedic mathematics to optimize the design of multipliers, squarers, and cubers. It begins by providing background on multipliers and their importance in electronic systems. It then reviews related work applying Vedic mathematics to multiplier design. The document outlines the methodology for performing multiplication, squaring, and cubing according to Vedic mathematics principles. It presents simulation and synthesis results comparing the proposed Vedic designs to traditional array-based designs, finding improvements in speed, power, and area. The document concludes that Vedic mathematics provides an effective approach for optimizing the design of these fundamental arithmetic components.
Cloud computing is the one of the emerging techniques to process the big data. Large collection of set or large
volume of data is known as big data. Processing of big data (MRI images and DICOM images) normally takes
more time compare with other data. The main tasks such as handling big data can be solved by using the concepts
of hadoop. Enhancing the hadoop concept it will help the user to process the large set of images or data. The
Advanced Hadoop Distributed File System (AHDF) and MapReduce are the two default main functions which
are used to enhance hadoop. HDF method is a hadoop file storing system, which is used for storing and retrieving
the data. MapReduce is the combinations of two functions namely maps and reduce. Map is the process of
splitting the inputs and reduce is the process of integrating the output of map’s input. Recently, in medical fields
the experienced problems like machine failure and fault tolerance while processing the result for the scanned
data. A unique optimized time scheduling algorithm, called Advanced Dynamic Handover Reduce Function
(ADHRF) algorithm is introduced in the reduce function. Enhancement of hadoop and cloud introduction of
ADHRF helps to overcome the processing risks, to get optimized result with less waiting time and reduction in
error percentage of the output image
Text mining has turned out to be one of the in vogue handle that has been joined in a few research
fields, for example, computational etymology, Information Retrieval (IR) and data mining. Natural
Language Processing (NLP) methods were utilized to extricate learning from the textual text that is
composed by people. Text mining peruses an unstructured form of data to give important
information designs in a most brief day and age. Long range interpersonal communication locales
are an awesome wellspring of correspondence as the vast majority of the general population in this
day and age utilize these destinations in their everyday lives to keep associated with each other. It
turns into a typical practice to not compose a sentence with remedy punctuation and spelling. This
training may prompt various types of ambiguities like lexical, syntactic, and semantic and because of
this kind of indistinct data; it is elusive out the genuine data arrange. As needs be, we are directing
an examination with the point of searching for various text mining techniques to get different
textual requests via web-based networking media sites. This review expects to depict how
contemplates in online networking have utilized text investigation and text mining methods to
identify the key topics in the data. This study concentrated on examining the text mining
contemplates identified with Facebook and Twitter; the two prevailing web-based social networking
on the planet. Aftereffects of this overview can fill in as the baselines for future text mining research.
Colorectal cancer (CRC) has potential to spread within the peritoneal cavity, and this transcoelomic
dissemination is termed “peritoneal metastases” (PM).The aim of this article was to summarise the current
evidence regarding CRC patients at high risk of PM. Colorectal cancer is the second most common cause of cancer
death in the UK. Prompt investigation of suspicious symptoms is important, but there is increasing evidence that
screening for the disease can produce significant reductions in mortality.High quality surgery is of paramount
importance in achieving good outcomes, particularly in rectal cancer, but adjuvant radiotherapy and chemotherapy
have important parts to play. The treatment of advanced disease is still essentially palliative, although surgery for
limited hepatic metastases may be curative in a small proportion of patients.
This document summarizes a research paper on the thermal performance of air conditioners using nanofluids compared to base fluids. Key points:
- Nanofluids, which are liquids containing nanoparticles, can improve heat transfer in heat pipes and cooling systems due to their higher thermal conductivity compared to base fluids.
- The document reviews how factors like nanofluid type, nanoparticle size and concentration affect thermal efficiency and heat transfer limits. It also examines using nanofluids to enhance heat exchange in transmission fluids.
- An experimental setup is described to study heat transfer and friction factors of water-based Al2O3 nanofluids in a horizontal tube under constant heat flux. Temperature, pressure and flow rate are measured
Now-a-day’s pedal powered grinding machine is used only for grinding purpose. Also, it requires lots of efforts
and limited for single application use. Another problem in existing model is that it consumed more time and also has
lower efficiency. Our aim is to design a human powered grinding machine which can also be used for many purposes
like pumping, grinding, washing, cutting, etc. it can carry water to a height 8 meter and produces 4 ampere of electricity
in most effective way. The system is also useful for the health conscious work out purpose. The purpose of this technical
study is to increase the performance and output capacity of pedal powered grinding machine.
This document summarizes a research paper that proposes using distributed control of multiple energy storage units (ESUs) to manage voltage and loading in electric distribution networks with renewable energy sources like solar and wind. The distributed control approach coordinates the ESUs to store excess power generated during peak periods and discharge it during peak load periods. Each ESU can provide both active and reactive power to support voltage and manage power flows. The distributed control strategy uses a consensus algorithm to divide the required active power reduction equally among ESUs based on their available capacity. Simulation results are presented to analyze the coordinated control of ESU active and reactive power outputs over time.
The steady increase in non-linear loads on the power supply network such as, AC variable speed drives,
DC variable Speed drives, UPS, Inverter and SMPS raises issues about power quality and reliability. In this
subject, attention has been focused on harmonics . Harmonics overload the power system network and cause
reliability problems on equipment and system and also waste energy. Passive and active harmonic filters are
used to mitigate harmonic problems. The use of both active and passive filter is justified to mitigate the
harmonics. The difficulty for practicing engineers is to select and deploy correct harmonic filters , This paper
explains which solutions are suitable when it comes to choosing active and passive harmonic filters and also
explains the mistakes need to be avoided.
This Paper is aimed at analyzing the few important Power System equipment failures generally
occurring in the Industrial Power Distribution system. Many such general problems if not resolved it may
lead to huge production stoppage and unforeseen equipment damages. We can improve the reliability of
Power system by simply applying the problem solving tool for every case study and finding out the root cause
of the problem, validation of root cause and elimination by corrective measures. This problem solving
approach to be practiced by every day to improve the power system reliability. This paper will throw the light
and will be a guide for the Practicing Electrical Engineers to find out the solution for every problem which
they come across in their day to day maintenance activity.
More from IJET - International Journal of Engineering and Techniques (20)
Sachpazis_Consolidation Settlement Calculation Program-The Python Code and th...Dr.Costas Sachpazis
Consolidation Settlement Calculation Program-The Python Code
By Professor Dr. Costas Sachpazis, Civil Engineer & Geologist
This program calculates the consolidation settlement for a foundation based on soil layer properties and foundation data. It allows users to input multiple soil layers and foundation characteristics to determine the total settlement.
Impartiality as per ISO /IEC 17025:2017 StandardMuhammadJazib15
This document provides basic guidelines for imparitallity requirement of ISO 17025. It defines in detial how it is met and wiudhwdih jdhsjdhwudjwkdbjwkdddddddddddkkkkkkkkkkkkkkkkkkkkkkkwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwioiiiiiiiiiiiii uwwwwwwwwwwwwwwwwhe wiqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq gbbbbbbbbbbbbb owdjjjjjjjjjjjjjjjjjjjj widhi owqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq uwdhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhwqiiiiiiiiiiiiiiiiiiiiiiiiiiiiw0pooooojjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj whhhhhhhhhhh wheeeeeeee wihieiiiiii wihe
e qqqqqqqqqqeuwiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiqw dddddddddd cccccccccccccccv s w c r
cdf cb bicbsad ishd d qwkbdwiur e wetwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww w
dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffw
uuuuhhhhhhhhhhhhhhhhhhhhhhhhe qiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii iqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccc bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbu uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuum
m
m mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm m i
g i dijsd sjdnsjd ndjajsdnnsa adjdnawddddddddddddd uw
Tools & Techniques for Commissioning and Maintaining PV Systems W-Animations ...Transcat
Join us for this solutions-based webinar on the tools and techniques for commissioning and maintaining PV Systems. In this session, we'll review the process of building and maintaining a solar array, starting with installation and commissioning, then reviewing operations and maintenance of the system. This course will review insulation resistance testing, I-V curve testing, earth-bond continuity, ground resistance testing, performance tests, visual inspections, ground and arc fault testing procedures, and power quality analysis.
Fluke Solar Application Specialist Will White is presenting on this engaging topic:
Will has worked in the renewable energy industry since 2005, first as an installer for a small east coast solar integrator before adding sales, design, and project management to his skillset. In 2022, Will joined Fluke as a solar application specialist, where he supports their renewable energy testing equipment like IV-curve tracers, electrical meters, and thermal imaging cameras. Experienced in wind power, solar thermal, energy storage, and all scales of PV, Will has primarily focused on residential and small commercial systems. He is passionate about implementing high-quality, code-compliant installation techniques.
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Sri Guru Hargobind Ji - Bandi Chor Guru.pdfBalvir Singh
Sri Guru Hargobind Ji (19 June 1595 - 3 March 1644) is revered as the Sixth Nanak.
• On 25 May 1606 Guru Arjan nominated his son Sri Hargobind Ji as his successor. Shortly
afterwards, Guru Arjan was arrested, tortured and killed by order of the Mogul Emperor
Jahangir.
• Guru Hargobind's succession ceremony took place on 24 June 1606. He was barely
eleven years old when he became 6th Guru.
• As ordered by Guru Arjan Dev Ji, he put on two swords, one indicated his spiritual
authority (PIRI) and the other, his temporal authority (MIRI). He thus for the first time
initiated military tradition in the Sikh faith to resist religious persecution, protect
people’s freedom and independence to practice religion by choice. He transformed
Sikhs to be Saints and Soldier.
• He had a long tenure as Guru, lasting 37 years, 9 months and 3 days
A high-Speed Communication System is based on the Design of a Bi-NoC Router, ...DharmaBanothu
The Network on Chip (NoC) has emerged as an effective
solution for intercommunication infrastructure within System on
Chip (SoC) designs, overcoming the limitations of traditional
methods that face significant bottlenecks. However, the complexity
of NoC design presents numerous challenges related to
performance metrics such as scalability, latency, power
consumption, and signal integrity. This project addresses the
issues within the router's memory unit and proposes an enhanced
memory structure. To achieve efficient data transfer, FIFO buffers
are implemented in distributed RAM and virtual channels for
FPGA-based NoC. The project introduces advanced FIFO-based
memory units within the NoC router, assessing their performance
in a Bi-directional NoC (Bi-NoC) configuration. The primary
objective is to reduce the router's workload while enhancing the
FIFO internal structure. To further improve data transfer speed,
a Bi-NoC with a self-configurable intercommunication channel is
suggested. Simulation and synthesis results demonstrate
guaranteed throughput, predictable latency, and equitable
network access, showing significant improvement over previous
designs
1. International Journal of Engineering and Techniques - Volume 2 Issue 3, May – June 2016
ISSN: 2395-1303 http://www.ijetjournal.org Page 33
Automatic Text Categorization on News Articles
Muthe Sandhya1
, Shitole Sarika2
, Sinha Anukriti3
, Aghav Sushila4
1,2,3,4(Department of Computer, MITCOE, Pune, India)
INTRODUCTION
As we know, news is a vital source for keeping man
aware of what is happening around the world. They keep
a person well informed and educated. The availability of
news on the internet has expanded the reach, availability
and ease of reading daily news on the web. News on the
web is quickly updated and there is no need to wait for a
printed version in the early morning to get updated on
the news. The booming news industry (especially online
news) has spoilt users for choices. The volume of
information available on the internet and corporate
intranets continues to increase. So, there is need of text
categorization.
Many text categorization approaches exist such as
Neural Network, decision tree etc.But the goal of this
paper is to use K-mean successfully for text
categorization. In the past work has been done on this
topic by Susan Dumais, Microsoft Research One,
Microsoft Way Redmond and their team. They used
inductive learning algorithms like naïve bayes and SVM
for text categorization. In 2002, Tan et al. used the
bigrams for text categorization successfully. In 2012
Xiqing Zhao and Lijun Wang of HeBei North
University, Zhang Jiakou , China improved KNN(K-
nearest neighbour) algorithm to categorize data into
groups. The current use of TF-IDF (term frequency
inverse document frequency) as a vector space model
has increased the scope of the text categorization
scenario. In this paper, we are using dictionary which
gives appropriate result with k-mean. This paper shows
the methods which gives the accurate result.
Text categorization is nothing but the process of
categorizing text documents on the basis of word,
phrases with respect to the categorize which are
predefined.
The flow of paper is as follows: - 1st
section gives the
article categorization, 2nd
section gives the application,
3rd gives the conclusion, 4th
gives the reference for the
paper.
1. Article categorization:-
Article categorization mainly includes five steps.
Firstly we have to gather news articles, secondly
preprocessing is done, then we have to apply TF-IDF,
after that k-mean then user interface is created in which
news articles are listed. The following fig. give the steps
RESEARCH ARTICLE OPEN ACCESS
Abstract:
Text categorization is a term that has intrigued researchers for quite some time now. It is the concept
in which news articles are categorized into specific groups to cut down efforts put in manually categorizing
news articles into particular groups. A growing number of statistical classification and machine learning
technique have been applied to text categorization. This paper is based on the automatic text categorization
of news articles based on clustering using k-mean algorithm. The goal of this paper is to automatically
categorize news articles into groups. Our paper mostly concentrates on K-mean for clustering and for term
frequency we are going to use TF-IDF dictionary is applied for categorization. This is done using mahaout
as platform.
Keywords — Text categorization, Preprocessing, K-mean, TF-IDF etc.
2. International Journal of Engineering and Techniques
ISSN: 2395-1303
of article categorization.
A. Document:-
The news are extracted from various websites like
yahoo, Reuters, CNN etc.For our project we use RSS for
extraction purpose.RSS is a protocol for providing news
summaries in xml format over the internet. This
extracted news are stored in HDFS (Hadoop Distribution
File System) which is used to stored large amount of
data.
B. Preprocessing:-
As large amount of data is stored in HDFS, it is
necessary to process the data for fast and accurate result.
Some modifications are done on the character such that
it should be best suitable for the feature extraction. In
this step stop word removal, stemming of data is done.
a. Stop word removal:-
The words like a, an, the, which increases the
volume of data but actual of not any use are removed
from the data. The process of removing these unused
words is known as stop word removal.
b. Stemming:-
Stemming is the process of reducing in
to their word stem or root .The root of the word is found
by this technnique.eg.The ‘running’ is reduced to the
‘run’.
C.TF-IDF:-
Tf-idf stands for term frequency-inverse
frequency. This term is used to predict the importance of
word in specific document. In this technique, the word
which is frequently/highly appear in a document have
high score .And words which appear in more frequency
in many document are having low score. So
going to do preprocessing before TF-IDF as it removes
the, an, a etc. and also if we ignore the preprocessing
then also the words like a, an, the are in many document
hence the low score.
International Journal of Engineering and Techniques - Volume 2 Issue 3, May –
1303 http://www.ijetjournal.org
of article categorization.
The news are extracted from various websites like
our project we use RSS for
extraction purpose.RSS is a protocol for providing news
summaries in xml format over the internet. This
extracted news are stored in HDFS (Hadoop Distribution
File System) which is used to stored large amount of
As large amount of data is stored in HDFS, it is
necessary to process the data for fast and accurate result.
Some modifications are done on the character such that
it should be best suitable for the feature extraction. In
val, stemming of data is done.
The words like a, an, the, which increases the
volume of data but actual of not any use are removed
from the data. The process of removing these unused
Stemming is the process of reducing inflected words
to their word stem or root .The root of the word is found
by this technnique.eg.The ‘running’ is reduced to the
inverse document
. This term is used to predict the importance of
word in specific document. In this technique, the word
which is frequently/highly appear in a document have
high score .And words which appear in more frequency
score. So, we are
IDF as it removes
the, an, a etc. and also if we ignore the preprocessing
then also the words like a, an, the are in many document
Basically, tf-idf is the multiplication of term
frequency (tf) and inverse document frequency
(idf).Mathematically it is represented as,
Tf -idf= tf (t,d)*idf (t,D).
Where,
Tf (t,d)=Raw frequency of term in a document.
Idf (t,D)=It is a frequency/importance of term
across all document.
Term frequency:-
Tf (t,d)=0.5+(0.5*f(t,d)/max. frequency of word).
Inverse Term Frequency:-
Idf (t,D)=log[(no .of documents)/(no. of documents in
which term t appear)]
D. K-mean:-
K-mean is the simplest algorithm which is used mainly
for the clustering. In our project we used mahout as the
platform which has its own k-mean driver.
Algorithm:-
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and
cluster centers.
3) Assign the data point to the cluster center whose
distance from the cluster center is minimum of all the
cluster centers.
4) Recalculate the new cluster center using:
vi= x1+x2+x3+x4+x5……………..x
_____________________
ci
Where,
‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and
new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise
repeat from step 3.
E .User interface
June 2016
Page 34
idf is the multiplication of term
(tf) and inverse document frequency
(idf).Mathematically it is represented as,
idf= tf (t,d)*idf (t,D).
Tf (t,d)=Raw frequency of term in a document.
Idf (t,D)=It is a frequency/importance of term
Tf (t,d)=0.5+(0.5*f(t,d)/max. frequency of word).
Idf (t,D)=log[(no .of documents)/(no. of documents in
mean is the simplest algorithm which is used mainly
we used mahout as the
mean driver.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and
3) Assign the data point to the cluster center whose
m the cluster center is minimum of all the
4) Recalculate the new cluster center using:
xn
_____________________
ci’ represents the number of data points in ith cluster.
lculate the distance between each data point and
6) If no data point was reassigned then stop, otherwise
3. International Journal of Engineering and Techniques
ISSN: 2395-1303
Lastly, User interface is created which is in
XML document. Actually, it is final outcome of the
article categorization. Clustering is performed. In simple
term, Clustering is nothing but the extraction of
meaningful/useful articles from large number of articles.
Result
This project uses Apache based framework support in
the form of Hadoop ecosystem, mainly HDFS (Hadoop
Distributed File System) and Apache Mahout. We have
implemented the project on net beans IDE using Mahout
Libraries for k-means and TF-IDF. The code has been
deployed on Tomcat server.
Fig .1. The Flow Diagram of the Project.
As discussed earlier, the database is extracted
from the internet using RSS. The data is stored in HDFS.
And then stop word removal and stemming is applied to
increase the efficiency of data. Then TF-
used in the project to understand the context of the
article by calculating the most occurring as well as the
considering the rarest words present in a document. As
we know k-mean doesn’t worked on the work. It only
deals with number. TF-IDF as a vector space model
helps convert words to a number format which can be
given as input to K-means algorithm in the form of
sequence file which has key-value pairs. K
algorithm gives us the news in the clustered form.
International Journal of Engineering and Techniques - Volume 2 Issue 3, May –
1303 http://www.ijetjournal.org
Lastly, User interface is created which is in
outcome of the
article categorization. Clustering is performed. In simple
term, Clustering is nothing but the extraction of
meaningful/useful articles from large number of articles.
This project uses Apache based framework support in
Hadoop ecosystem, mainly HDFS (Hadoop
Distributed File System) and Apache Mahout. We have
implemented the project on net beans IDE using Mahout
IDF. The code has been
roject.
As discussed earlier, the database is extracted
from the internet using RSS. The data is stored in HDFS.
And then stop word removal and stemming is applied to
-IDF has been
the context of the
article by calculating the most occurring as well as the
considering the rarest words present in a document. As
mean doesn’t worked on the work. It only
IDF as a vector space model
a number format which can be
means algorithm in the form of
value pairs. K-mean
algorithm gives us the news in the clustered form.
Fig.2. Login Page
The fig. shows the login page. User can log
registered himself. If user doesn’t have account, then
user has to click on the registered here to registered
himself.
Fig.3.Extracted News
Fig. shows the news extracted from different web which
are stored on HDFS
June 2016
Page 35
Fig.2. Login Page
The fig. shows the login page. User can login here if he
registered himself. If user doesn’t have account, then
user has to click on the registered here to registered
Fig.3.Extracted News
Fig. shows the news extracted from different web which
4. International Journal of Engineering and Techniques
ISSN: 2395-1303
Fig.4.Clustering Process
Above fig.4. Shows the clustering process. The weights
are calculated and sequence file is generated using TF
IDF.
Fig.5.Final Cluster Data
Above fig.5.shows the clustered data which shows the
different document with their id in same cluster.
APPLICATION
2.1. Automatic indexing for Boolean information
retrieval systems
The application that has stimulated the research
in text categorization from its very beginning, back in
the ’60s, up until the ’80s, is that of automatic indexing
of scientific articles by means of a controlled dictionary,
such as the ACM Classification Scheme, where the
categories are the entries of the controlled dictionary.
This is typically a multi-label task, since several index
terms are usually assigned to each document.
International Journal of Engineering and Techniques - Volume 2 Issue 3, May –
1303 http://www.ijetjournal.org
Above fig.4. Shows the clustering process. The weights
are calculated and sequence file is generated using TF-
Fig.5.Final Cluster Data
Above fig.5.shows the clustered data which shows the
their id in same cluster.
2.1. Automatic indexing for Boolean information
The application that has stimulated the research
in text categorization from its very beginning, back in
omatic indexing
of scientific articles by means of a controlled dictionary,
such as the ACM Classification Scheme, where the
categories are the entries of the controlled dictionary.
label task, since several index
y assigned to each document.
Automatic indexing with controlled dictionaries
is closely related to the automated metadata generation
task. In digital libraries one is usually interested in
tagging documents by metadata that describe them under
a variety of aspects (e.g. creation date, document type or
format, availability, etc.). Some of these metadata are
thematic, i.e. their role is to describe the semantics of the
document by means of bibliographic codes, keywords or
key-phrases. The generation of these metadata may thus
be viewed as a problem of document indexing with
controlled dictionary, and thus tackled by means of text
categorization techniques. In the case of Web
documents, metadata describing them will be needed for
the Semantic Web to become a reality, and text
categorization techniques applied to Web data may be
envisaged as contributing part of the solution to the huge
problem of generating the metadata needed by Semantic
Web resources.
2.2. Document Organization
Indexing with a controlled vocabulary is an
instance of the general problem of document base
organization. In general, many other issues pertaining to
document organization and filing, be it for purposes of
personal organization or structuring of a corporate
document base, may be addressed by text categorization
techniques. For instance, at the offices of a newspaper, it
might be necessary to classify all past article
ease future retrieval in the case of new events related to
the ones described by the past articles. Possible
categories might be Home News, International, Money,
Lifestyles, Fashion, but also finer-
The Prime Minister’s USA visit.
Another possible application in the same range
is the organization of patents into categories for making
later access easier, and of patent applications for
allowing patent officers to discover possible prior work
on the same topic. This application, as all applications
having to do with patent data, introduces specific
problems, since the description of the allegedly novel
technique, which is written by the patent applicant, may
intentionally use non standard vocabulary in order to
create the impression that the technique is indeed novel.
This use of non standard vocabulary may depress the
June 2016
Page 36
Automatic indexing with controlled dictionaries
automated metadata generation
task. In digital libraries one is usually interested in
tagging documents by metadata that describe them under
a variety of aspects (e.g. creation date, document type or
format, availability, etc.). Some of these metadata are
o describe the semantics of the
document by means of bibliographic codes, keywords or
phrases. The generation of these metadata may thus
be viewed as a problem of document indexing with
controlled dictionary, and thus tackled by means of text
ation techniques. In the case of Web
documents, metadata describing them will be needed for
the Semantic Web to become a reality, and text
categorization techniques applied to Web data may be
envisaged as contributing part of the solution to the huge
em of generating the metadata needed by Semantic
Indexing with a controlled vocabulary is an
instance of the general problem of document base
organization. In general, many other issues pertaining to
ization and filing, be it for purposes of
personal organization or structuring of a corporate
document base, may be addressed by text categorization
techniques. For instance, at the offices of a newspaper, it
might be necessary to classify all past articles in order to
ease future retrieval in the case of new events related to
the ones described by the past articles. Possible
categories might be Home News, International, Money,
-grained ones such as
Another possible application in the same range
is the organization of patents into categories for making
later access easier, and of patent applications for
allowing patent officers to discover possible prior work
on, as all applications
having to do with patent data, introduces specific
problems, since the description of the allegedly novel
technique, which is written by the patent applicant, may
intentionally use non standard vocabulary in order to
ession that the technique is indeed novel.
This use of non standard vocabulary may depress the
5. International Journal of Engineering and Techniques - Volume 2 Issue 3, May – June 2016
ISSN: 2395-1303 http://www.ijetjournal.org Page 37
performance of a text classifier, since the assumption
that underlies practically all text categorization work is
that training documents and test documents are drawn
from the same word distribution.
2.3. Text filtering
Text filtering is the activity of classifying a
stream of incoming documents dispatched in an
asynchronous way by an information producer to an
information consumer. Typical cases of filtering systems
are e-mail filters (in which case the producer is actually
a multiplicity of producers), newsfeed filters, or filters of
unsuitable content. A filtering system should block the
delivery of the documents the consumer is likely not
interested in. Filtering is a case of binary text
categorization, since it involves the classification of
incoming documents in two disjoint categories, the
relevant and the irrelevant. Additionally, a filtering
system may also further classify the documents deemed
relevant to the consumer into thematic categories of
interest to the user. A filtering system may be installed
at the producer end, in which case it must route the
documents to the interested consumers only, or at the
consumer end, in which case it must block the delivery
of documents deemed uninteresting to the consumer.
In information science document filtering has a
tradition dating back to the ’60s, when, addressed by
systems of various degrees of automation and dealing
with the multi-consumer case discussed above, it was
called selective dissemination of information or current
awareness. The explosion in the availability of digital
information has boosted the importance of such systems,
which are nowadays being used in diverse contexts such
as the creation of personalized Web newspapers, junk e-
mail blocking, and Usenet news selection.
2.4. Word Sense Disambiguation
Word sense disambiguation (WSD) is the
activity of finding, given the occurrence in a text of an
ambiguous (i.e. polysemous or homonymous) word, the
sense of this particular word occurrence. For instance,
bank may have (at least) two different senses in English,
as in the Bank of Scotland (a financial institution) or the
bank of river Thames (a hydraulic engineering artifact).
It is thus a WSD task to decide which of the
above senses the occurrence of bank in Yesterday I
withdrew some money from the bank has. WSD may be
seen as a (single-label) TC task once, given a word w,
we view the contexts of occurrence of w as documents
and the senses of w as categories.
2.5. Hierarchical Categorization of Web Pages
Text categorization has recently aroused a lot of
interest for its possible use to automatically classifying
Web pages, or sites, under the hierarchical catalogues
hosted by popular Internet portals. When Web
documents are catalogued in this way, rather than
issuing a query to a general-purpose Web search engine
a searcher may find it easier to first navigate in the
hierarchy of categories and then restrict a search to a
particular category of interest. Classifying Web pages
automatically has obvious advantages, since the manual
categorization of a large enough subset of the Web is not
feasible. With respect to previously discussed text
categorization applications, automatic Webpage
categorization has two essential peculiarities which are
namely the hyper textual nature of the documents, and
the typically hierarchical structure of the category set.
2.6. Automated Survey coding
Survey coding is the task of assigning a
symbolic code from a predefined set of such codes to the
answer that a person has given in response to an open-
ended question in a questionnaire (aka survey). This task
is usually carried out in order to group respondents
according to a predefined scheme based on their
answers. Survey coding has several applications,
especially in the social sciences, where the classification
of respondents is functional to the extraction of statistics
on political opinions, health and lifestyle habits,
customer satisfaction, brand fidelity, and patient
satisfaction.
Survey coding is a difficult task, since the code
that should be attributed to a respondent based on the
answer given is a matter of subjective judgment, and
thus requires expertise. The problem can be seen as a
single-label text categorization problem, where the
answers play the role of the documents, and the codes
6. International Journal of Engineering and Techniques - Volume 2 Issue 3, May – June 2016
ISSN: 2395-1303 http://www.ijetjournal.org Page 38
that are applicable to the answers returned to a given
question play the role of the categories (different
questions thus correspond to different text categorization
problems).
CONCLUSION
Article categorization is very critical task as
every method gives different result. There are many
methods like neural network, k-medoid etc but we are
using k-mean and TF-IDF, which increases accuracy of
the result. Also it cluster the article in proper ways.
In future we will try to solve the article
categorization for the data which is obtained by using
web crawler thus creating one stop solution for news.
Also, sophisticated text classifiers are not yet available
which if developed would be useful for several
governmental and commercial works. Incremental text
classification, multi-topic text classification, discovering
the presence and contextual use of newly evolving terms
on blogs etc. are some areas where future work can be
done.
REFERENCES
[1] Document Clustering, Pankaj Jajoo, IITR.
[2] Noam Slonim and Naftali Tishby. “The Power of
Word Clusters for Text Classification”School of
Computer Science and Engineering and The
Interdisciplinary Center for Neural Computation The
Hebrew University, Jerusalem 91904, Israel.
[3] Mahout in Action, Sean Owen, Robin Anil, Ted
Dunning and Ellen Friedman Manning Publications,
2012 edition
[4] Saleh Alsaleem, Automated Arabic Text
Categorization Using SVM and NB,June 2011.
[5] Wen Zang and Xijin Tang,TFIDF , LSI and Multi-
word in Information Retrieval and text categorization
,Nov-2008
[6] R.S. Zhou and Z.J. Wang, A Review of a Text
Classification Technique: K- Nearest Neighbor, CISIA
2015
[7] FABRIZIO SEBASTIANI, Machine Learning in
Automated Text Categorization.
[8] Susan Dumais , David Heckerman and Mehran
Sahami, Inductive Learning Algorithms and
Representations for Text Categorization
[9] RON BEKKERMAN and JAMES ALLAN, Using
Bigrams in Text Categorization,Dec 2013.
[10]Lodhi, J. Shawe-Taylor, N. Cristianini, and C.J.C.H.
Watkins. Text classification using string kernels. In
Advances in Neural Information Processing Systems
(NIPS), pages 563–569, 2000.