International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING cscpconf
In the last decade, ontologies have played a key technology role for information sharing and agents interoperability in different application domains. In semantic web domain, ontologies are efficiently used toface the great challenge of representing the semantics of data, in order to bring the actual web to its full
power and hence, achieve its objective. However, using ontologies as common and shared vocabularies requires a certain degree of interoperability between them. To confront this requirement, mapping ontologies is a solution that is not to be avoided. In deed, ontology mapping build a meta layer that allows different applications and information systems to access and share their informations, of course, after resolving the different forms of syntactic, semantic and lexical mismatches. In the contribution presented in this paper, we have integrated the semantic aspect based on an external lexical resource, wordNet, to design a new algorithm for fully automatic ontology mapping. This fully automatic character features the
main difference of our contribution with regards to the most of the existing semi-automatic algorithms of ontology mapping, such as Chimaera, Prompt, Onion, Glue, etc. To better enhance the performances of our algorithm, the mapping discovery stage is based on the combination of two sub-modules. The former
analysis the concept’s names and the later analysis their properties. Each one of these two sub-modules is
it self based on the combination of lexical and semantic similarity measures.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
SEMANTIC INTEGRATION FOR AUTOMATIC ONTOLOGY MAPPING cscpconf
In the last decade, ontologies have played a key technology role for information sharing and agents interoperability in different application domains. In semantic web domain, ontologies are efficiently used toface the great challenge of representing the semantics of data, in order to bring the actual web to its full
power and hence, achieve its objective. However, using ontologies as common and shared vocabularies requires a certain degree of interoperability between them. To confront this requirement, mapping ontologies is a solution that is not to be avoided. In deed, ontology mapping build a meta layer that allows different applications and information systems to access and share their informations, of course, after resolving the different forms of syntactic, semantic and lexical mismatches. In the contribution presented in this paper, we have integrated the semantic aspect based on an external lexical resource, wordNet, to design a new algorithm for fully automatic ontology mapping. This fully automatic character features the
main difference of our contribution with regards to the most of the existing semi-automatic algorithms of ontology mapping, such as Chimaera, Prompt, Onion, Glue, etc. To better enhance the performances of our algorithm, the mapping discovery stage is based on the combination of two sub-modules. The former
analysis the concept’s names and the later analysis their properties. Each one of these two sub-modules is
it self based on the combination of lexical and semantic similarity measures.
Taxonomy extraction from automotive natural language requirements using unsup...ijnlc
In this paper we present a novel approach to semi-automatically learn concept hierarchies from natural
language requirements of the automotive industry. The approach is based on the distributional hypothesis
and the special characteristics of domain-specific German compounds. We extract taxonomies by using
clustering techniques in combination with general thesauri. Such a taxonomy can be used to support
requirements engineering in early stages by providing a common system understanding and an agreedupon
terminology. This work is part of an ontology-driven requirements engineering process, which builds
on top of the taxonomy. Evaluation shows that this taxonomy extraction approach outperforms common
hierarchical clustering techniques.
Text document clustering and similarity detection is the major part of document management, where every document should be identified by its key terms and domain knowledge. Based on the similarity, the documents are grouped into clusters. For document similarity calculation there are several approaches were proposed in the existing system. But the existing system is either term based or pattern based. And those systems suffered from several problems. To make a revolution in this challenging environment, the proposed system presents an innovative model for document similarity by applying back propagation time stamp algorithm. It discovers patterns in text documents as higher level features and creates a network for fast grouping. It also detects the most appropriate patterns based on its weight and BPTT performs the document similarity measures. Using this approach, the document can be categorized easily. In order to perform the above, a new approach is used. This helps to reduce the training process problems. The above framework is named as BPTT. The BPTT has implemented and evaluated using dot net platform with different set of datasets.
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Paper Explained: Understanding the wiring evolution in differentiable neural ...Devansh16
Read my Explanation of the Paper here: https://medium.com/@devanshverma425/why-and-how-is-neural-architecture-search-is-biased-778763d03f38?sk=e16a3e54d6c26090a6b28f7420d3f6f7
Abstract: Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
A Novel Approach for User Search Results Using Feedback SessionsIJMER
In present scenario user search results using Fuzzy c-means algorithm focuses queries are
submitted to search engines to represent the information needs of users. The proposed feedbacks
sessions are clustered by data are bound to each cluster by means of a membership function. Feedback
sessions are constructed from user click-through logs and can efficiently reflect the information
needs of users. Pseudo-documents are generated to better understand the clustered feedbacks. Fuzzy
C-means clustering algorithm is used to cluster the feedbacks. Clustering the feedbacks can effectively
reflect the user needs. Fuzzy c-means algorithm uses the reciprocal of distances to decide the cluster
centers. Ranking model is used to provide ranks to the URL based on the user search
feedbacks. Evaluate the performance using “Classified Average Precision (CAP)” for user search
results.
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
Considerable research in the field of ontology matching has been performed where information sharing
and reuse becomes necessary in ontology development. Measurement of lexical similarity in ontology
matching is performed using synset, defined in WordNet. In this paper, we defined a Super Word Set,
which is an aggregate set that includes hypernym, hyponym, holonym, and meronym sets in WordNet.
The Super Word Set Similarity is calculated by the rate of words of concept name and synset’s words
inclusion in the Super Word Set. In order to measure of Super Word Set Similarity, we first extracted
Matched Concepts(MC), Matched Properties(MP) and Property Unmatched Concepts(PUC) from the
result of ontology matching. We compared these against two ontology matching tools – COMA++ and
LOM. The Super Word Set Similarity shows an average improvement of 12% over COMA++ and 19%
over LOM.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSsipij
In this paper, we present a set of spatial relations between concepts describing an ontological model for a
new process of character recognition. Our main idea is based on the construction of the domain ontology
modelling the Latin script. This ontology is composed by a set of concepts and a set of relations. The
concepts represent the graphemes extracted by segmenting the manipulated document and the relations are
of two types, is-a relations and spatial relations. In this paper we are interested by description of second
type of relations and their implementation by java code.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
Ekotex SCHONE LUCHT is speciaal glasweefsel wandbekleding voor het onderwijs. Het filtert schadelijke stoffen uit de lucht. Voldoet aan BREEAM en jet PvE voor Frisse scholen, gezonde scholen
Summarization using ntc approach based on keyword extraction for discussion f...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Conceptual similarity measurement algorithm for domain specific ontology[Zac Darcy
This paper presents the similarity measurement algorithm for domain specific terms collected in the
ontology based data integration system. This similarity measurement algorithm can be used in ontology
mapping and query service of
ontology based data integration sy
stem. In this paper, we focus
o
n the web
query service to apply
this proposed algorithm
. Concepts similarity is important for web query service
because the words in user input query are not
same wholly with the concepts in
ontology. So, we need to
extract the possible concepts that are match or related to the input words with the help of machine readable
dictionary WordNet. Sometimes, we use the generated mapping rules in query generation procedure for
some words that canno
t be
confirmed the similarity of these words
by WordNet. We prove the effect
of this
algorithm with two degree semantic result of web minin
g by generating
the concepts results obtained form
the input query
Paper Explained: Understanding the wiring evolution in differentiable neural ...Devansh16
Read my Explanation of the Paper here: https://medium.com/@devanshverma425/why-and-how-is-neural-architecture-search-is-biased-778763d03f38?sk=e16a3e54d6c26090a6b28f7420d3f6f7
Abstract: Controversy exists on whether differentiable neural architecture search methods discover wiring topology effectively. To understand how wiring topology evolves, we study the underlying mechanism of several existing differentiable NAS frameworks. Our investigation is motivated by three observed searching patterns of differentiable NAS: 1) they search by growing instead of pruning; 2) wider networks are more preferred than deeper ones; 3) no edges are selected in bi-level optimization. To anatomize these phenomena, we propose a unified view on searching algorithms of existing frameworks, transferring the global optimization to local cost minimization. Based on this reformulation, we conduct empirical and theoretical analyses, revealing implicit inductive biases in the cost's assignment mechanism and evolution dynamics that cause the observed phenomena. These biases indicate strong discrimination towards certain topologies. To this end, we pose questions that future differentiable methods for neural wiring discovery need to confront, hoping to evoke a discussion and rethinking on how much bias has been enforced implicitly in existing NAS methods.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
A Novel Approach for User Search Results Using Feedback SessionsIJMER
In present scenario user search results using Fuzzy c-means algorithm focuses queries are
submitted to search engines to represent the information needs of users. The proposed feedbacks
sessions are clustered by data are bound to each cluster by means of a membership function. Feedback
sessions are constructed from user click-through logs and can efficiently reflect the information
needs of users. Pseudo-documents are generated to better understand the clustered feedbacks. Fuzzy
C-means clustering algorithm is used to cluster the feedbacks. Clustering the feedbacks can effectively
reflect the user needs. Fuzzy c-means algorithm uses the reciprocal of distances to decide the cluster
centers. Ranking model is used to provide ranks to the URL based on the user search
feedbacks. Evaluate the performance using “Classified Average Precision (CAP)” for user search
results.
Ontology Matching Based on hypernym, hyponym, holonym, and meronym Sets in Wo...dannyijwest
Considerable research in the field of ontology matching has been performed where information sharing
and reuse becomes necessary in ontology development. Measurement of lexical similarity in ontology
matching is performed using synset, defined in WordNet. In this paper, we defined a Super Word Set,
which is an aggregate set that includes hypernym, hyponym, holonym, and meronym sets in WordNet.
The Super Word Set Similarity is calculated by the rate of words of concept name and synset’s words
inclusion in the Super Word Set. In order to measure of Super Word Set Similarity, we first extracted
Matched Concepts(MC), Matched Properties(MP) and Property Unmatched Concepts(PUC) from the
result of ontology matching. We compared these against two ontology matching tools – COMA++ and
LOM. The Super Word Set Similarity shows an average improvement of 12% over COMA++ and 19%
over LOM.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
An Entity-Driven Recursive Neural Network Model for Chinese Discourse Coheren...ijaia
Chinese discourse coherence modeling remains a challenge taskin Natural Language Processing
field.Existing approaches mostlyfocus on the need for feature engineering, whichadoptthe sophisticated
features to capture the logic or syntactic or semantic relationships acrosssentences within a text.In this
paper, we present an entity-drivenrecursive deep modelfor the Chinese discourse coherence evaluation
based on current English discourse coherenceneural network model. Specifically, to overcome the
shortage of identifying the entity(nouns) overlap across sentences in the currentmodel, Our combined
modelsuccessfully investigatesthe entities information into the recursive neural network
freamework.Evaluation results on both sentence ordering and machine translation coherence rating
task show the effectiveness of the proposed model, which significantly outperforms the existing strong
baseline.
ConNeKTion: A Tool for Exploiting Conceptual Graphs Automatically Learned fro...University of Bari (Italy)
Studying, understanding and exploiting the content of a digital library, and extracting useful information thereof, require automatic techniques that can effectively support the users. To this aim, a relevant role can be played by concept taxonomies. Unfortunately, the availability of such a kind of resources is limited, and their manual building and maintenance are costly and error-prone. This work presents ConNeKTion, a tool for conceptual graph learning and exploitation. It allows to learn conceptual graphs from plain text and to enrich them by finding concept generalizations. The resulting graph can be used for several purposes: finding relationships between concepts (if any), filtering the concepts from a particular perspective, keyword extraction and information retrieval. A suitable control panel is provided for the user to comfortably carry out these activities.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
ONTOLOGICAL MODEL FOR CHARACTER RECOGNITION BASED ON SPATIAL RELATIONSsipij
In this paper, we present a set of spatial relations between concepts describing an ontological model for a
new process of character recognition. Our main idea is based on the construction of the domain ontology
modelling the Latin script. This ontology is composed by a set of concepts and a set of relations. The
concepts represent the graphemes extracted by segmenting the manipulated document and the relations are
of two types, is-a relations and spatial relations. In this paper we are interested by description of second
type of relations and their implementation by java code.
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING mlaij
The Volume of text resources have been increasing in digital libraries and internet. Organizing these text documents has become a practical need. For organizing great number of objects into small or minimum number of coherent groups automatically, Clustering technique is used. These documents are widely used for information retrieval and Natural Language processing tasks. Different Clustering algorithms require a metric for quantifying how dissimilar two given documents are. This difference is often measured by similarity measure such as Euclidean distance, Cosine similarity etc. The similarity measure process in text
mining can be used to identify the suitable clustering algorithm for a specific problem. This survey discusses the existing works on text similarity by partitioning them into three significant approaches; String-based, Knowledge based and Corpus-based similarities.
Ekotex SCHONE LUCHT is speciaal glasweefsel wandbekleding voor het onderwijs. Het filtert schadelijke stoffen uit de lucht. Voldoet aan BREEAM en jet PvE voor Frisse scholen, gezonde scholen
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Presentación divulgativa sobre Smart Cities desde el punto de vista tecnológico realizada en Zaragoza en la sesión Tech me out! del Pint of Science 2015.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
In this paper we tried to correlate text sequences those provides common topics for semantic clues. We propose a two step method for asynchronous text mining. Step one check for the common topics in the sequences and isolates these with their timestamps. Step two takes the topic and tries to give the timestamp of the text document. After multiple repetitions of step two, we could give optimum result.
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Optimizer algorithms and convolutional neural networks for text classificationIAESIJAI
Lately, deep learning has improved the algorithms and the architectures of several natural language processing (NLP) tasks. In spite of that, the performance of any deep learning model is widely impacted by the used optimizer algorithm; which allows updating the model parameters, finding the optimal weights, and minimizing the value of the loss function. Thus, this paper proposes a new convolutional neural network (CNN) architecture for text classification (TC) and sentiment analysis and uses it with various optimizer algorithms in the literature. Actually, in NLP, and particularly for sentiment classification concerns, the need for more empirical experiments increases the probability of selecting the pertinent optimizer. Hence, we have evaluated various optimizers on three types of text review datasets: small, medium, and large. Thereby, we examined the optimizers regarding the data amount and we have implemented our CNN model on three different sentiment analysis datasets so as to binary label text reviews. The experimental results illustrate that the adaptive optimization algorithms Adam and root mean square propagation (RMSprop) have surpassed the other optimizers. Moreover, our best CNN model which employed the RMSprop optimizer has achieved 90.48% accuracy and surpassed the state-of-the-art CNN models for binary sentiment classification problems.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
Comparative analysis of c99 and topictiling text segmentation algorithmseSAT Journals
Abstract In this paper, the work done includes the extraction of information from image datasets which contain natural text. The difficulty level of segmenting natural text from an image is too high and so precision is the most important factor to be kept in mind. To minimize the error rates, error filtration technique is provided, as filtration is adopted while doing image segmentation basically text segmentation present in images. Furthermore, a comparative analysis of two different text segmentation algorithms namely C99 and TopicTiling on image documents is presented. To assess how well each algorithm works, each was applied on different datasets and results were compared. The work done also proves the efficiency of TopicTiling over C99. Index Terms: Text Segmentation, text extraction, image documents,C99 and TopicTiling.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
EXPLOITING RHETORICAL RELATIONS TO MULTIPLE DOCUMENTS TEXT SUMMARIZATIONIJNSA Journal
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical relations to address redundancy problem for cluster-based text summarization of multiple documents. We exploited rhetorical relations exist between sentences to group similar sentences into multiple clusters to identify themes of common information. The candidate summary were extracted from these clusters. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of the candidate summary. We evaluated our method by measuring the cohesion and separation of the clusters constructed by exploiting rhetorical relations and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in text clustering which benefits text summarization of multiple documents.
The increased potential of the ontologies to reduce the human interference has wide range of applications. This paper identifies requirements for an ontology development platform to innovate artificially intelligent web. To facilitate this process, RDF and OWL have been developed as standard formats for the sharing and integration of data and knowledge. The knowledge in the form of rich conceptual schemas called ontologies. Based on the framework, an architectural paradigm is put forward in view of ontology engineering and development of ontology applications and a development portal designed to support ontology engineering, content authoring and application development with a view to maximal scalability in size and complexity of semantic knowledge and flexible reuse of ontology models and ontology application processes in a distributed and collaborative engineering environment.
HYPONYMY EXTRACTION OF DOMAIN ONTOLOGY CONCEPT BASED ON CCRFS AND HIERARCHY C...dannyijwest
Concept hierarchy is the backbone of ontology, and the concept hierarchy acquisition has been a hot topic in the field of ontology learning. this paper proposes a hyponymy extraction method of domain ontology concept based on cascaded conditional random field(CCRFs) and hierarchy clustering. It takes free text as extracting object, adopts CCRFs identifying the domain concepts. First the low layer of CCRFs is used to identify simple domain concept, then the results are sent to the high layer, in which the nesting concepts are recognized. Next we adopt hierarchy clustering to identify the hyponymy relation between domain ontology concepts. The experimental results demonstrate the proposed method is efficient.
Concept hierarchy is the backbone of ontology, and the concept hierarchy acquisition has been a hot topic in the field of ontology learning. this paper proposes a hyponymy extraction method of domain ontology concept based on cascaded conditional random field(CCRFs) and hierarchy clustering. It takes free text as extracting object, adopts CCRFs identifying the domain concepts. First the low layer of CCRFs is used to identify simple domain concept, then the results are sent to the high layer, in which the nesting concepts are recognized. Next we adopt hierarchy clustering to identify the hyponymy relation between domain ontology concepts. The experimental results demonstrate the proposed method is efficient.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
Many of previous research have proven that the usage of rhetorical relations is capable to enhance many applications such as text summarization, question answering and natural language generation. This work proposes an approach that expands the benefit of rhetorical
relations to address redundancy problem in text summarization. We first examined and redefined the type of rhetorical relations that is useful to retrieve sentences with identical content and performed the identification of those relations using SVMs. By exploiting the
rhetorical relations exist between sentences, we generate clusters of similar sentences from document sets. Then, cluster-based text summarization is performed using Conditional Markov Random Walk Model to measure the saliency scores of candidates summary. We evaluated our
method by measuring the cohesion and separation of the clusters and ROUGE score of generated summaries. The experimental result shows that our method performed well which shows promising potential of applying rhetorical relation in cluster-based text summarization.
Effect of word embedding vector dimensionality on sentiment analysis through ...IAESIJAI
Word embedding has become the most popular method of lexical description
in a given context in the natural language processing domain, especially
through the word to vector (Word2Vec) and global vectors (GloVe)
implementations. Since GloVe is a pre-trained model that provides access to
word mapping vectors on many dimensionalities, a large number of
applications rely on its prowess, especially in the field of sentiment analysis.
However, in the literature, we found that in many cases, GloVe is
implemented with arbitrary dimensionalities (often 300d) regardless of the
length of the text to be analyzed. In this work, we conducted a study that
identifies the effect of the dimensionality of word embedding mapping
vectors on short and long texts in a sentiment analysis context. The results
suggest that as the dimensionality of the vectors increases, the performance
metrics of the model also increase for long texts. In contrast, for short texts,
we recorded a threshold at which dimensionality does not matter.
SEARCH OF INFORMATION BASED CONTENT IN SEMI-STRUCTURED DOCUMENTS USING INTERF...ijcsitcejournal
This paper proposes a semi-structured information retrieval model based on a new method for calculation
of similarity. We have developed CASISS (Calculation of Similarity of Semi-Structured documents)
method to quantify how two given texts are similar. This new method identifies elements of semi-structured
documents using elements descriptors. Each semi-structured document is pre-processed before the
extraction of a set of descriptors for each element, which characterize the contents of elements.It can be
used to increase the accuracy of the information retrieval process by taking into account not only the
presence of query terms in the given document but also the topology (position continuity) of these terms.
Entity Annotation WordPress Plugin using TAGME TechnologyTELKOMNIKA JOURNAL
The development of internet technology makes more information can be accessed. It makes
information need to be organized in order to be easily managed. One solution can be used is by using the
entity annotation approach which generates tags to represent that document. In this study, TAGME
technology is implemented on a WordPress plugin, which is used to manage a blog. Moreover, information
on Wikipedia ‘Bahasa Indonesia’ is processed to generate an anchor dictionary which is required by the
technology that is implemented. This plugin performs entity annotation by giving tag suggestion for posts in
a blog. Testing is carried out by measuring the precision, recall, and of tag suggestions given by the
plugin. The result shows that the plugin can give tag suggestions with precision 0.7638, recall 0.5508, and
0.59.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx
B046021319
1. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 13 | P a g e
The Topic Tracking Based on Modified VSM of Lexical Chain,
S
Sememe
Jing Ma*, Fei Wu*, Chi Li **, Hengmin Zhu***
*(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing,
China)
** (College of Mathematics, University of science and technology of China, Hefei, China)
*** (College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing,
China)
ABSTRACT
Vector Space Model (VSM) has aroused significant research attention in recent years due to its advantage in
topic tracking. However, its effectiveness has been restrained by its incapability in revealing same-concept
semantic information of different keywords or hidden semantic relations of the text, making the accuracy of
topic tracking hardly guaranteed. Confronting these issues with concern, a modified VSM, namely Semantic
Vector Space Model, is put forward. To establish the model, numerous lexical chains based on HowNet are first
built, then sememes of the lexical chains are extracted as characteristics of feature vectors. Afterwards, initial
weight and structural weight of the characteristics are calculated to construct the Semantic Vector Space Model,
encompassing both semantic and structural information. The initial weight is collected from word frequency,
while the structure weight is obtained from a designed calculation method: Each lexical chain structure weight is
defined as (m + 1)/S, m is the number of the other similar chains, and S is the number of the reports used for
extraction of the lexical chains. Finally, the model is applied in web news topic tracking with satisfactory
experimental results, conforming the method to be effective and desirable.
Keywords - Topic tracking, Vector Space Model, Lexical chain, Sememe
[1] INTRODUCTION
Topic tracking is a method that mainly works to
get the topic model on the basis of training corpus
and then track the follow-up reports related to the
topic. It gathers isolated information scattered in
different time and places to demonstrate full details
of events and relationships between them [1]
. Since
documents written by natural language can hardly be
comprehended by computer, a mathematical
representation of the document model is required to
be defined to realize document processing by
computer. Along with this methodological thinking,
several approaches are presented , such as Boolean
Model, Vector Space Model and probability Model
the conceptual Model, etc., among which Vector
Space Model (shorten as VSM, proposed by G.
Salton, A. Wong, and C. S. Yang in the late 1960s)
appears to be most popular and successfully applied
in the famous SMART system. After that, the model
and its related technologies, including selection of
items, weight strategy and queuing optimization, had
been widely used [2]
in text classification, automatic
index, information retrieval and many other fields,
making it the mainstream model in topic tracking.
One of VSM's advantages is its knowledge
representation. A document is transformed into a
space vector, the document’s operation is thus
converted to the vector’s mathematical operation,
reducing the complexity of the problem. The
semantic information of the text, however, is ignored
by this method, which means the accuracy cannot be
guaranteed. A proper solution here is to use external
semantic knowledge to improve Vector Space Model.
For example: Hu Jiming [3]
, Starting from mechanism
analysis of user modeling based on semantic
hierarchy tree, they used domain ontology to
accomplish resource description and user modeling.
Thereby building a Semantic Vector Space Model.
The effort helped add semantic information into
VSM, but since the theory and technology research of
ontology are not in-depth [4]
, they didn’t solve the
problem thoroughly. Jin Zhu [5]
,made full use of the
external semantic resources—HowNet, to realize
effective topic tracking and classify subject position
on the basis of the information retrieval technology.
Although she had considered the semantic meaning
of the text, the structure information was neglected.
Lexical chain, put forward by Halliday and
Hasan [6]
first in 1976, is a kind of external behavior
of the continuity of semantic relations between words,
it has a corresponding relationship with the structure
of the text, providing important clues of the structure
and theme [7]
.
From what has been discussed above, the paper
will introduce HowNet and lexical chains in the
process of building model, constructing lexical chains
RESEARCH ARTICLE OPEN ACCESS
2. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 14 | P a g e
based on HowNet. Then it will build a sentimatic
vector space model of the topic based on sememe of
the lexical chains, which include the semantic
information and structure information of the text.
Finally when applied into Sina Weibo topic tracking,
the experiment proved that the method is effective.
[2] BUILDING VECTOR SPACE MODEL
BASED ON THE LEXICAL CHAIN
’
S
SEMEME
2.1 The extraction of the lexical chain based on
HowNet
HowNet is a commonsense knowledge base
which describes the concept represented by Chinese
and English words. It reveals the relationship
between concepts and attribute of the concepts [8]
. In
the literature [9]
, Morris and Hirst first introduced
Lexical Chain concept, which is constructed to split
the text to get the information of text structure. The
lexical chain constructed in this paper is based on the
semantic similarity, it also contains semantic
information and structure information of the text. The
lexical chain building steps are as below:
(1) Use the ICTCLAS segmentation tools developed
by Chinese academy of sciences to construct the
word set with the automatic segmentation of text.
(2) Select the first word from the set sequentially to
build the initial lexical chain. Then select the
candidate words sequentially. After that, compute
the similarity between the candidate words and
the chain if it meets the threshold requirements.
Finally insert the word into the current lexical
chain or skip it if it does not meet the
requirements.
(3) Output current lexical chain and delete the words
of the chain in the vocabulary, if the word set is
empty then the process is accomplished. If not,
switch to operation (2).
(4) Circulate the operation until the word set is
empty.
Specific process is in Fig. 2.1
Lexical chain build pseudo code is as follows:
K = 1; / / K’s initial value is 1
LK[] = { }; / / chain’s initialization
Count = 0;
The Word [] = (W1, W2, W3, ... , Wn); / /
participle
Void LexicalChainBuilding(Lk)
{Lk[0] = Word[0] ; / / treat the first word as the
initial value of lexical chain, k is the lexical chain’s
serial number.
Fig. 2.1 The extraction of lexical chain
For( int i= 1; i< = word.length; i+ + )
{if (SimilaryCompare (word [0], word [i]) > 0.5); /
/ if the similarity between the two words is greater
than 0.5
{
LK [+ + Count] = word [i]; / / insert to the current
chain
}
}
}
Out. Print (LK); / / output current lexical chain
DeleteChainFromWord(Lk); / / delete the words of
the chain in the vocabulary
JudgeWordIsEmpty (); / / determine whether the
current word set is empty
{if (JudgeWordIsEmpty () = = true)
(
Break; / / the end
)
else
(
K++;
UpdateWord(); / / update the word in the
vocabulary
Void LexicalChainBuilding(Lk); / / recursive call
lexical chain building program
)
}
3. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 15 | P a g e
2.2 Building vector space model based on the
lexical chain’s sememe.
Since this paper constructs lexical chain based on
semantic similarity of words, semantic information of
each word in the lexical chains is similar. Based on
this, the paper extracts the representative sememe
from each lexical chain as characteristics of feature
vectors. This paper use word frequency as initial
weight of the characteristics and the structure weight
is obtained from the designed calculation method.
Finally, it uses the structure weight to adjust the
initial weight of the characteristics to construct the
semantic vector space model of the topic. T = ( L1
LW1, L2, LW2, L3, LW3;..., Ln, LWn) .
Ln represent the sememe of the chain, LWn
represent the weight of it. Below is the specific
process.
Construct lexical
chain based on hownet
extract the lexical chain
righteousness
build the semantic feature
vector
calculate structure weight of
thelexical chain
Fig.2.2 the construction of sememe vector space
model
In this way, the vector will not only reduce the
dimension of vector space, but also include the
semantic and the structural information of the text.
[3] THE DESIGN ABOUT THE ALGORITHM
OF TOPIC TRACKING
Since our chosen corpus is for a specific topic,
we took the TF (word frequency statistics) method to
get the initial weights of feature, and lexical chains
extracted from all training corpus completely reveal
the structure characteristics of the subject. Based on
this, the topic tracking algorithm is designed as
follows:
(1)Extract the lexical chain and the sememe of it after
doing word segmentation, part-of-speech tagging,
and removing duplicate words of the topic training
samples. Then use the sememe as characteristics of
the VSM to constitute a semantic vector. The initial
weights of the sememe is the sum of the weight of all
the key words in the chain .The initial space vector of
the topic is: T = (TW1, TW2, ... , TWn).
(2)Use the sememe to calculate the similarity
between lexical chains. Set a threshold value and
define the two lexical chains to be similar when the
degree of similarity between lexical chains is greater
than the threshold. Count the sum of the other chains
which are similar with the current one and define it as
“m”.
(3) Each lexical chain structure weight is defined as
TW= (m + 1)/S, m is the number of the other chains
that are similar with the current chains is the number
of the reports used for extraction of lexical chains.
The final weight of each feature of the topic is the
product of the initial weight and the structure weight
of the lexical chain that has the feature, it is defined
as tw = Tw * (m + 1)/S, thus the final vector of the
topic is: T = (tw1, tw2, ... , twn).
(4)Use the same method to deal with the subsequent
reports, then the vector of the reports will eventually
be: d = (dw1, dw2, ... , dwn).
The paper takes the cosine formula of the vector
to compute the similarity between the topic and the
follow-up reports. The formula is as follows:
:
sim(T,d)=cos<T,d>=
nj
j
jj
ni
i
ii
nji
ji
ji dwdwtwtwdwtw
11
.,
,1,
***
。
T is for the subject; D is for later reports; Twi
represents the weight of the ith feature of the topic;
dwj represent the weight of the jth feature of the
subsequent reports.
For each subsequent reports, use the similarity
model described above to compute the similarity
between the topic and later reports: sim (T, d), when
the similarity is greater than the threshold, define
them as similar. The specific process is shown in
Fig.3.1:
the pretreatment of the topic
samples
Extract the lexical chain based
on the hownet
Calculate the similarity
between the chains
building the semantic feature
vector based on the structure
weight
determine the chain,
s
structure weight based on the
numbers of the similar chain
Skip the report
building the semantic
feature vector of the
report
Store and render the news
reports
Calculate the similarity
The text of the report
Meet the threshold
requirements
yes
no
Fig. 3.1 the algorithm of topic tracking
4. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 16 | P a g e
[4] EXPERIMENTS AND RESULTS
This article selects three topics—the H7N9
treatment of bird flu, Syrian refugees, wasp stings—
to do the experiments. Based on the operation above,
three topic righteousness original feature vectors
space are obtained as follows:
A:The treatment of H7N9:
(H7N9, bird flu, adjust, cure, published, drug,
places, eliminate, show, Property, people, monitor,
agency, disease, know) B wasps hurt:(place, dead,
cure, worm, people, organization, against, time,
damage, parts, tell, bad thing, using, eliminate, check,
understand, form, work on) C Syria's
refugee:(represent, countries, struggle realize, agency,
people, rescue, phenomenon(difficult) avoid, enter,
appear, records, situation, increase )
Then after the calculating the vectors are as
follows:
TH7N9=(5.1, 4.4, 0.2, 8.1, 0.8, 4.1, 0.7, 0.7, 0.3,
0.3, 3.6, 1.5, 0.4, 0.9, 0.3, 0.2)
T wasp stings = (3.2, 1,3,17.2, 0.6, 3.2, 0.6, 2, 0.6,
1.2, 1.4, 1, 0.8, 0.4, 0.6, 0.4, 0.4, 0.4, 0.4)
T Syria= (1.6, 15.4, 1.4, 1.6, 15.4, 13.2, 1, 3, 2.2,
0.4, 0.4, 0.4, 0.8, 0.6)
The paper then selects 5 similar reports for each
of the topic by domain experts to calculate similarity.
Take the topic about H7N9 as example. After
processing, the characteristic vector space of one of
the five reports is : (H7N9, bird flu, 0, heal,
published, drugs, 0,0,0,0,0,0,0,0).
After calculating, the feature vectors of the
report is: t = (,0,0,0,0,0,0,0,0,0,2,0,3.6 2, 1.1, 3.6, 0)
According to the cosine formula of vector space
model, the similarity is: 63.8/ 13.35*94.145 =89%
To verify the effectiveness of method, this paper
uses the traditional vector space model to do an
experiment as comparison. Still take the topic of
H7N9 as an example. The characteristic vector space
constructed by word frequency statistics is: (method
53, H7N9 51, bird flu 44, injection 39, people 32,
infection 32, diagnosis and treatment
30,pharmaceutical 24, cases 19, company 18, country
18, varieties 17, flu 15, detection 15, control 15,
Chinese medicine 15, prevent 13, virus 13,
prevention and cure 12, health 11, kang yuan 11, lab
10, income 9, recommended 9, committee 8, products
8, published 8, diagnosis 7, medicine 7, Capsule 7,
patients 6, traditional Chinese medicine 6, drugs 6,
Selected 6, sales 6, program 6, control 6, think 5, use
5, the ministry of health 5, expert 4, Business
4,hospital 4, detoxification 4, contact 4, printing 4,
state of an illness 4, samples 4, children 4, agency 4,
Chinese patent drugs 4.
Then the vector is :T=(53, 51, 44, 39, 32, 32, 30,
26, 24, 19, 18, 18, 17, 15, 15, 15, 15, 13, 13,, 12, 11,
11, 10, 9, 9, 8, 8, 8, 7, 7, 7, 7, 6, 6, 6, 6, 6, 6, 6, 5, 5, 5,
4, 4, 4, 4, 4, 4, 4, 4, 4, 4)
Then we use the words frequency method to
construct the vector for the report used in the last
experiment: t =(3x, 2, 2, 6, 0, 1, 3, 1, 1, 0, 5, 4, 1, 0, 0,
0, 0, 0, 0, 0, 1, 0, 0, 2, 1, 1, 3, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
The similarity is: 1052/ 131*16118 =72.4%,
obviously it is lower than the similarity calculated
based on the lexical chain’s sememe space vector.
The similarity of three topics is in table 4.1:
Table 4.1 details of the similarity
Label data in the coordinate system and connect the
point with a straight line, we get Fig. 4.1:
Fig.4.1 contrast of the similarity
The serial number of the reports is on horizontal
axis and the similarity is on the vertical. The picture
shows that the similarity of new method is higher.
In order to further verify the superiority of the
algorithm, this paper has designed a topic tracking
experiment system. The system mainly includes the
following three parts: the pretreatment of network
reports, solution selection and topic tracking. The
solution selection module is organized by the
construction of lexical chain’s sememe and word
frequency statistics. The detail is in Fig. 4.2:
5. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 17 | P a g e
The topic tracking experiment model
The preparement of the
reports
build the semantic feature
vector based on the lexical
chains
construct feature vectors
based on word frequency
statistics
extractlexicalchain
Basedonhownet
Extractthesememeofthe
chainandestablishits
initialweights
calculatethechain
,
s
structureweight
buildingthesemantic
featurevector
Wordfrequencystatistics
buildingthefeaturevector
Topic tracking
Buildtopicmodel
Buildreportmodel
Similaritycalculation
Contrasttheresultsof
tracking
solution selection
participles
Markthepartofspeech
Removetheunusedwords
Fig. 4.2 topic tracking experiment system
The TDT established a complete evaluation
system which uses the rates of non-response PMiss,
the rate of false positives PFA and the loss cost
(CDet) Norm of the system to indicate the
performance of the system.
The paper downloaded 269,250,232 news
reports for the three topics, the details is in table 4.2.
The paper set threshold value of 0.5 and 0.6 for
two topic tracking, and take H7N9 as experiment to
show the process.
Table 4.2 details of the corpus
After importing the data into database, the detail
of the initial database is in Fig. 4.3:
After tracking by using word frequency statistics
method, the result is shown in Fig. 4.4:
There is a total of 43 records, including 39
related to the topic and 4 unrelated.
After tracking by using lexical chain’s sememe
method, the result is shown in Fig. 4.5:
There is a total of 55 records, including 46
related to the topic and 9 unrelated.
The detail of three topic’s tracking result is in
table 4.3:
There is complete evaluation system, in which
people use misdiagnosis rate PMiss and omissive
judgement rate PFA to calculate the overhead of
detection CDet, then normalize CDet to loss cost
(CDet)Norm, which is the evaluation index of the
topic tracking system. The smaller value of
(CDet)Norm indicates the better system performance.
The formulas are as follows:
et arg arg* * * *D Miss Miss t et FA FA non t etC C P P C P P
arg arg
( )
min( * , * )
Det
Det Norm
Miss t et FA non t et
C
C
C P C P
CMiss=1, CFA=0.1, Ptarget=0.02, Pnon-target=1-
Ptarget.
PMiss and PFA are both as small as possible. Their
formulas are as follows:
= *100%Miss
The number of related reports that system does not recognize
P
The total number of related reports in corpora
*100%FA
The number of reports that system misjudges as related to the topic
P
The total number of unrelated reports incorpora
PMiss and PFA are both as small as possible. The
comparison and analysis of the result is in table 4.4,
4.5:
6. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 18 | P a g e
Fig.4.3 the initial database
Fig.4.4 The result of tracking by using word frequency statistics method
Fig. 4.5 the result of tracking by using lexical chain’s sememe method
7. Jing Ma et al Int. Journal of Engineering Research and Applications www.ijera.com
ISSN : 2248-9622, Vol. 4, Issue 6( Version 2), June 2014, pp.13-19
www.ijera.com 19 | P a g e
Table 4.3 The detail of three topic’s tracking
result(t=0.5)
From table 4.4, 4.5, one can indicated that the
non-response rates of the lexical chain’s sememe is
lower than the rates of the word frequency statistics
method, but the rate of false positives is higher than
that based on word frequency statistics method.
Above all, the wastage of the approach based on the
lexical chain’s sememe is lower than the loss cost
based on word frequency statistics method, proving
the topic tracking algorithm based on the lexical
chain’s sememe to be effective.
Table 4.4 comparison and analysis of the result
(t=0.5)
When the threshold is 0.6 the comparison and
analysis of the result is in table 4.5:
Table 4.5 comparison and analysis of the result
(t=0.6)
[5] CONCLUSIONS
The paper extracts the lexical chains based on
the external semantic resource—HowNet, then it
takes the sememe of the chain as the feature to build
the original feature vector. The weight of the feature
is determined by the method of the word frequency
statistics combined with the structure weight of the
lexical chain, the semantic information and structure
information of the text are also fully considered. In
the topic tracking experiment system, the loss cost of
the improved model is smaller, improving the
efficiency of topic tracking.
References
Journal Papers:
[1] YANG Yim-ing, CARBONELL J, BROWN
R, et al. Learning Approaches for Detecting
and Tracking News Event. IEEE Intelligent
Systems: Special Issue on Applications of
Intelligent Information Retrieval,14(4),1999,
32-43.
[2] Hu Jiming, Hu Changping. The user modeling
based on topic hierarchy tree and semantic
vector space model. Journal of intelligence, 32
(8), 2013.8, 838-843.
[3] Beydoun G,Lopez—Lorca A A,et a1.
How do we measure and improve the quality
of a hierarchical ontology . Journal of
Systems and Software , 84 (12) ,2011 ,
2363—2373.
[4] Jin Zhu Lin Hongfei. Topic tracking and
tendency analysis based on HowNet. Journal
of intelligence, 24 (5), 2005, 555-561.
[5] Gonenc E,Ilyas C.Using Lexical Chains
for Keyword Extrac—tion . Information
Processing and Management, 43(6), 2007,
1705—1714.
[6] HowNet[R].HowNetsHome.Page.HTTP://WW
W.keenage.com.
[7] J Morris, G Hirst. Lexical Cohesion Computed
by Thesauralrelations as all Indieator of the
Structure of Text. Computational Linguistics,
17(1), 1991, 21-48.
Books:
[6] James Allan. Introduction to topic detection
and tracking// James Allan,ed. Topic Detection
and Tracking Event—based Information
Organization. (USA: Kluwer Academic
Publishers, 2002).
[7] Halliday M A K, Hasan R. Cohesion in
English. London, (UK: Longman, 1976).