The document describes a project to semantically annotate research papers with ACM classification categories. It discusses using cosine similarity, latent Dirichlet allocation, and a proposed model combining labeled LDA and doc2vec. The proposed model trains a supervised topic model to learn document representations that capture semantic relationships between papers and categories. The model achieved 59.31% mean average precision and 45.03% NDCG on a test dataset, demonstrating an improvement over baselines.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Slides: Concurrent Inference of Topic Models and Distributed Vector Represent...Parang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
Abstract: Topic modeling techniques have been widely used to uncover dominant themes hidden inside an unstructured document collection. Though these techniques first originated in the probabilistic analysis of word distributions, many deep learning approaches have been adopted recently. In this paper, we propose a novel neural network based architecture that produces distributed representation of topics to capture topical themes in a dataset. Unlike many state-of-the-art techniques for generating distributed representation of words and documents that directly use neighboring words for training, we leverage the outcome of a sophisticated deep neural network to estimate the topic labels of each document. The networks, for topic modeling and generation of distributed representations, are trained concurrently in a cascaded style with better runtime without sacrificing the quality of the topics. Empirical studies reported in the paper show that the distributed representations of topics represent intuitive themes using smaller dimensions than conventional topic modeling approaches.
For more information, please visit: http://people.cs.vt.edu/parang/ or contact parang at firstname at cs vt edu
Spam filtering poses a critical problem in
text categorization as the features of text is
continuously changing. Spam evolves continuously and
makes it difficult for the filter to classify the evolving
and evading new feature patterns. Most practical
applications are based on online user feedback, the
task calls for fast, incremental and robust learning
algorithms. This paper presents a system for
automatically detection and filtering of unsolicited
electronic messages. In this paper, we have developed
a content-based classifier, which uses two topic models
LSI and PLSA complemented with a text patternmatching
based natural language approach. By
combining these powerful statistical and NLP
techniques we obtained a parallel content based Spam
filter, which performs the filtration in two stages. In
the first stage each model generates its individual
predictions, which are combined by a voting
mechanism as the second stage.
Comparative study of classification algorithm for text based categorizationeSAT Journals
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.
In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.
Textual Data Partitioning with Relationship and Discriminative AnalysisEditor IJMTER
Data partitioning methods are used to partition the data values with similarity. Similarity
measures are used to estimate transaction relationships. Hierarchical clustering model produces tree
structured results. Partitioned clustering produces results in grid format. Text documents are
unstructured data values with high dimensional attributes. Document clustering group ups unlabeled text
documents into meaningful clusters. Traditional clustering methods require cluster count (K) for the
document grouping process. Clustering accuracy degrades drastically with reference to the unsuitable
cluster count.
Textual data elements are divided into two types’ discriminative words and nondiscriminative
words. Only discriminative words are useful for grouping documents. The involvement of
nondiscriminative words confuses the clustering process and leads to poor clustering solution in return.
A variation inference algorithm is used to infer the document collection structure and partition of
document words at the same time. Dirichlet Process Mixture (DPM) model is used to partition
documents. DPM clustering model uses both the data likelihood and the clustering property of the
Dirichlet Process (DP). Dirichlet Process Mixture Model for Feature Partition (DPMFP) is used to
discover the latent cluster structure based on the DPM model. DPMFP clustering is performed without
requiring the number of clusters as input.
Document labels are used to estimate the discriminative word identification process. Concept
relationships are analyzed with Ontology support. Semantic weight model is used for the document
similarity analysis. The system improves the scalability with the support of labels and concept relations
for dimensionality reduction process.
CONTEXT-AWARE CLUSTERING USING GLOVE AND K-MEANSijseajournal
ABSTRACT
In this paper we propose a novel method to cluster categorical data while retaining their context. Typically, clustering is performed on numerical data. However it is often useful to cluster categorical data as well, especially when dealing with data in real-world contexts. Several methods exist which can cluster categorical data, but our approach is unique in that we use recent text-processing and machine learning advancements like GloVe and t- SNE to develop a a context-aware clustering approach (using pre-trained
word embeddings). We encode words or categorical data into numerical, context-aware, vectors that we use to cluster the data points using common clustering algorithms like K-means.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
SUPERVISED LEARNING METHODS FOR BANGLA WEB DOCUMENT CATEGORIZATIONijaia
This paper explores the use of machine learning approaches, or more specifically, four supervised learning
Methods, namely Decision Tree(C 4.5), K-Nearest Neighbour (KNN), Naïve Bays (NB), and Support Vector
Machine (SVM) for categorization of Bangla web documents. This is a task of automatically sorting a set of
documents into categories from a predefined set. Whereas a wide range of methods have been applied to English text categorization, relatively few studies have been conducted on Bangla language text categorization. Hence, we attempt to analyze the efficiency of those four methods for categorization of Bangla documents. In order to validate, Bangla corpus from various websites has been developed and used as examples for the experiment. For Bangla, empirical results support that all four methods produce
satisfactory performance with SVM attaining good result in terms of high dimensional and relatively noisy
document feature vectors.
SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONSIJDKP
Big Data creates many challenges for data mining experts, in particular in getting meanings of text data. It is beneficial for text mining to build a bridge between word embedding process and graph capacity to connect the dots and represent complex correlations between entities. In this study we examine processes of building a semantic graph model to determine word associations and discover document topics. We introduce a novel Word2Vec2Graph model that is built on top of Word2Vec word embedding model. We demonstrate how this model can be used to analyze long documents, get unexpected word associations and uncover document topics. To validate topic discovery method we transfer words to vectors and vectors to images and use CNN deep learning image classification.
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
This article will introduce some approaches for improving text categorization models by integrating
previously imported ontologies. From the Reuters Corpus Volume I (RCV1) dataset, some categories very
similar in content and related to telecommunications, Internet and computer areas were selected for models
experiments. Several domain ontologies, covering these areas were built and integrated to categorization
models for their improvements.
In recent years, Word2Vec and its expansion (Doc2Vec, Paragraph2Vec, etc.) is receiving a lot of attention in the NLP field.
In this slide, we will introduce our approach for applying the Doc2Vec to the item recommender system. And we report the results of the performance evaluation of Doc2Vec-based recommender by using Rakuten Singapore EC data.
Learning to Rank for Recommender Systems - ACM RecSys 2013 tutorialAlexandros Karatzoglou
The slides from the Learning to Rank for Recommender Systems tutorial given at ACM RecSys 2013 in Hong Kong by Alexandros Karatzoglou, Linas Baltrunas and Yue Shi.
EXPERT OPINION AND COHERENCE BASED TOPIC MODELINGijnlc
In this paper, we propose a novel algorithm that rearrange the topic assignment results obtained from topic
modeling algorithms, including NMF and LDA. The effectiveness of the algorithm is measured by how much
the results conform to expert opinion, which is a data structure called TDAG that we defined to represent the
probability that a pair of highly correlated words appear together. In order to make sure that the internal
structure does not get changed too much from the rearrangement, coherence, which is a well known metric
for measuring the effectiveness of topic modeling, is used to control the balance of the internal structure.
We developed two ways to systematically obtain the expert opinion from data, depending on whether the
data has relevant expert writing or not. The final algorithm which takes into account both coherence and
expert opinion is presented. Finally we compare amount of adjustments needed to be done for each topic
modeling method, NMF and LDA.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
A Large number of digital text information is generated every day. Effectively searching, managing and
exploring the text data has become a main task. In this paper, we first present an introduction to text
mining and LDA topic model. Then we deeply explained how to apply LDA topic model to text corpus by
doing experiments on Simple Wikipedia documents. The experiments include all necessary steps of data
retrieving, pre-processing, fitting the model and an application of document exploring system. The result of
the experiments shows LDA topic model working effectively on documents clustering and finding the
similar documents. Furthermore, the document exploring system could be a useful research tool for
students and researchers.
Classification of Health Forum Messages using Deep LearningSejal Naidu
In this project, we study the problem of classifying questions posted on healthcare forums. Information seekers post their questions on forums, and other members of communities provide their suggestions or answers. Tagging the posts with relevant tasks helps draw attention from the members with relevant expertise. Usually the tagging is done by the original poster or by other members of the forum. Automatic tagging on the question based on the contents of the post is a very important step to ensure relevant and timely responses.
A Novel Multi- Viewpoint based Similarity Measure for Document ClusteringIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
Object-Oriented Database Model For Effective Mining Of Advanced Engineering M...cscpconf
Materials have become a very important aspect of our daily life and the search for better and
new kind of engineered materials has created some opportunities for the Information science
and technology fraternity to investigate in to the world of materials. Hence this combination of
materials science and Information science together is nowadays known as Materials
Informatics. An Object-Oriented Database Model has been proposed for organizing advanced engineering materials datasets.
TOPIC EXTRACTION OF CRAWLED DOCUMENTS COLLECTION USING CORRELATED TOPIC MODEL...ijnlc
The tremendous increase in the amount of available research documents impels researchers to propose topic models to extract the latent semantic themes of a documents collection. However, how to extract the hidden topics of the documents collection has become a crucial task for many topic model applications. Moreover, conventional topic modeling approaches suffer from the scalability problem when the size of documents collection increases. In this paper, the Correlated Topic Model with variational ExpectationMaximization algorithm is implemented in MapReduce framework to solve the scalability problem. The proposed approach utilizes the dataset crawled from the public digital library. In addition, the full-texts of the crawled documents are analysed to enhance the accuracy of MapReduce CTM. The experiments are conducted to demonstrate the performance of the proposed algorithm. From the evaluation, the proposed approach has a comparable performance in terms of topic coherences with LDA implemented in MapReduce framework.
RSDC (Reliable Scheduling Distributed in Cloud Computing)IJCSEA Journal
In this paper we will present a reliable scheduling algorithm in cloud computing environment. In this algorithm we create a new algorithm by means of a new technique and with classification and considering request and acknowledge time of jobs in a qualification function. By evaluating the previous algorithms, we understand that the scheduling jobs have been performed by parameters that are associated with a failure rate. Therefore in the roposed algorithm, in addition to previous parameters, some other important parameters are used so we can gain the jobs with different scheduling based on these parameters. This work is associated with a mechanism. The major job is divided to sub jobs. In order to balance the jobs we should calculate the request and acknowledge time separately. Then we create the scheduling of each job by calculating the request and acknowledge time in the form of a shared job. Finally efficiency of the system is increased. So the real time of this algorithm will be improved in comparison with the other algorithms. Finally by the mechanism presented, the total time of processing in cloud computing is improved in comparison with the other algorithms.
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
Presentation of research paper "A Benchmark for the Use of Topic Models for Text Visualization Tasks" at the 15th International Symposium on Visual Information Communication and Interaction in Chur, Switzerland.
Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Instructions for Submissions thorugh G- Classroom.pptxJheel Barad
This presentation provides a briefing on how to upload submissions and documents in Google Classroom. It was prepared as part of an orientation for new Sainik School in-service teacher trainees. As a training officer, my goal is to ensure that you are comfortable and proficient with this essential tool for managing assignments and fostering student engagement.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Students, digital devices and success - Andreas Schleicher - 27 May 2024..pptxEduSkills OECD
Andreas Schleicher presents at the OECD webinar ‘Digital devices in schools: detrimental distraction or secret to success?’ on 27 May 2024. The presentation was based on findings from PISA 2022 results and the webinar helped launch the PISA in Focus ‘Managing screen time: How to protect and equip students against distraction’ https://www.oecd-ilibrary.org/education/managing-screen-time_7c225af4-en and the OECD Education Policy Perspective ‘Students, digital devices and success’ can be found here - https://oe.cd/il/5yV
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
How to Split Bills in the Odoo 17 POS ModuleCeline George
Bills have a main role in point of sale procedure. It will help to track sales, handling payments and giving receipts to customers. Bill splitting also has an important role in POS. For example, If some friends come together for dinner and if they want to divide the bill then it is possible by POS bill splitting. This slide will show how to split bills in odoo 17 POS.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
This is a presentation by Dada Robert in a Your Skill Boost masterclass organised by the Excellence Foundation for South Sudan (EFSS) on Saturday, the 25th and Sunday, the 26th of May 2024.
He discussed the concept of quality improvement, emphasizing its applicability to various aspects of life, including personal, project, and program improvements. He defined quality as doing the right thing at the right time in the right way to achieve the best possible results and discussed the concept of the "gap" between what we know and what we do, and how this gap represents the areas we need to improve. He explained the scientific approach to quality improvement, which involves systematic performance analysis, testing and learning, and implementing change ideas. He also highlighted the importance of client focus and a team approach to quality improvement.
Thesis Statement for students diagnonsed withADHD.ppt
Semantic Annotation of Documents
1. Project Guide:
Dr. Vasudev Verma
Priya Radhakrishnan
Participants:
Rashi Shrishrimal 201301203
Naman Singhal 201330072
K S Chandra Reddy 201505544
Semantic Annotation of
Documents
2. Problem Statement
Annotation of title and abstract of research paper to 2012 ACM classification
categories.
The 2012 ACM Computing Classification System has been developed as a poly-
hierarchical ontology that can be utilized in semantic web applications. It plays
a key role in the development of a people search interface in the ACM Digital
Library to supplement its current traditional bibliographic search.
The complete classification tree can be found here.
4. Dataset Overview
1. A private dataset of the SIEL lab of IIIT-Hyderabad. Fields:
a. Title
b. Author
c. Country of authors
d. Year of publication
e. Conference
f. Categories
g. Abstract (For few research papers)
2. Wiki Dataset : For all the acm categories a dataset was build having the summary of the wikipedia page (The First
section of the wikipedia document).
3. Dblp dataset : For initial tasks, this dataset was used to train the lda model. Though this was not used for the final
proposition of the model.
5. Technical Approach and Models
Multiple approaches were tried for the given problem
1) Cosine similarity
2) Latent Dirichlet Allocation
3) Labeled LDA + Doc2vec
6. Approach 1 : Cosine
Similarity
Intuition Behind The Approach:
Title and abstract paper generally contain
words contained in the description of the
categories so finding the cosine similarity of the
tf-idf vector of the title + abstract and all the
categories and assigning the closest one became
our first approach.
7. Approach 2 : Latent Dirichlet
Allocation
Intuition Behind The Approach:
Topic modelling - a topic model is a type
of statistical model for discovering the
abstract "topics" that occur in a collection
of documents.
We picked the most common topic
modelling algorithm - Latent Dirichlet
Allocation
8. Proposed Model : Labeled LDA + Doc2Vec
Intuition :
Accuracy can only be improved by supervised topic modelling i.e. by labelling the
documents with topics and taking care of the semantic distance between the
research paper and the categories.
References :
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora -
Stanford University (2009)
Distributed Representations of Sentences and Document - Google Inc. (2014)
9. Labeled LDA is a probabilistic graphical model that describes a process for
generating a labeled document collection. Like Latent Dirichlet Allocation, Labeled
LDA models each document as a mixture of underlying topics and generates each
word from one topic. Unlike LDA, L-LDA incorporates supervision by simply
constraining the topic model to use only those topics that correspond to a
document’s (observed) label set.
Labeled Latent Dirichlet Allocation
10. Doc2Vec
Doc2Vec is a tool provided by gensim that maps each sentences to a paragraph
vector. Paragraph vector is an unsupervised framework that learns continuous
distributed vector representations for pieces of texts. The texts can be of variable-
length, ranging from sentences to documents. The name Paragraph Vector is to
emphasize the fact that the method can be applied to variable-length pieces of
texts, anything from a phrase or sentence to a large document.
13. Mean Average Precision
Mean average precision for a set of queries is the mean of the average precision scores for each query.
where Q is the number of queries.
In our work, the number of queries is the number of test documents and average precision is calculated by
the definition: Average of the precision values at the points at which each relevant document is retrieved.
For only Doc2Vec model, MAP = (0.2518) 25.18%
For the final model (approach 3), we are getting a MAP of 0.5931 (59.31%).
We believe for our work and the data provided, this MAP is quite good. The results of such approaches
depend a lot on the data.
- For performance measure
14. NDCG (Normalized Discounted Cumulative
Gain)
For only Doc2Vec NDCG = 35.26%
For the final approach, Doc2Vec + labeled LDA, NDCG = 45.03%
15. Results Overview
Categories used for classification :
Artificial - Intelligence, Databases, Computer-Vision,
Information Retrieval and Other.
Training Dataset :
No. of research papers : 549
Test Dataset:
No. of research papers : 51745
No. of research papers correctly classified : 36334
MAP score : 59.31 %
NDCG score : 45.03%