This document discusses improving the categorization of scientific articles for an expert search system. It proposes training a category model using labeled training texts from a related domain, rather than requiring labeled scientific articles. Features are extracted from the training texts using TF-IDF and n-grams. The category model is tested on scientific articles by calculating cosine similarity between article and category feature vectors. An evaluation compares automated versus manual training text selection across common categories and common/specific categories, finding that manual selection achieves higher accuracy averages across five expert evaluators. The approach shows potential but challenges include selecting representative training texts and ensuring category coverage of the domain.
Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang
Review : Your Classifier is Secretly an Energy based model and you should treat it like one
- by Seunghyun Hwang (Yonsei University, Severance Hospital, Center for Clinical Data Science)
Your Classifier is Secretly an Energy based model and you should treat it lik...Seunghyun Hwang
Review : Your Classifier is Secretly an Energy based model and you should treat it like one
- by Seunghyun Hwang (Yonsei University, Severance Hospital, Center for Clinical Data Science)
Survey of natural language processing(midp2)Tariqul islam
Document classification is a part of Natural language processing. We have different methodology and technique for processing the document classification. The purpose of this article is to survey some papers related to document classification. Those survey will help the researcher to understand which will be the best approach to use for natural language processing
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
Quantification of Portrayal Concepts using tf-idf Weightingijistjournal
Term frequencies and inverse document frequencies have been successfully applied in determining
weighting for document rankings. However these have been more successful in text mining and in
extraction techniques used in the web. Concept mining has become increasingly popular in the research
and application areas of Computer Science. This paper attempts to demonstrate the limited usage of term
frequency and inverse document frequency for the application of weighting calculations for ranking
documents that are based on concept quantifications. The case study considered for experiment in this
paper, is based on concept terms of David Merrill’s First Principles of Instruction (FPI). Merrill’s FPI
applies cognitive structures explicitly for analyzing instructional materials. Therefore it is justified that the
terms categorized under each cognitive structure (or portrayal) of FPI can be taken as respective concept
of that portrayal. As question papers are representative of cognitive structures in a more clear and logical
way, four question papers on ‘C Language’ have been considered for the experimental study, that are
detailed in this paper. Manual method has been adopted for the computation of quantities of portrayals in
selected documents for the purpose of comparative study. As manual method is accurate, the values
(results) are considered as benchmark values. These benchmark values are considered for comparing with
normalized term frequencies that are derived (experimented) from automated extractions from the same
selected documents. The study is however limited to four documents only. Conclusions are drawn from this
experimental study, which will be of immense use to concept mining researchers as well as for instructional
designers.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. The information need can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query.
Using Naive Bayesian Classifier for Predicting Performance of a Studentijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can relate to this student grade well and we also used classification algorithms prediction. We proposed student data classification using Naive Bayesian Classifier. Khin Khin Lay | Aung Cho "Using Naive Bayesian Classifier for Predicting Performance of a Student" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26463.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26463/using-naive-bayesian-classifier-for-predicting-performance-of-a-student/khin-khin-lay
Universidad Técnica Particular de Loja
Ciclo Académico Abril Agosto 2011
Carrera: Inglés
Docente: Lic. Alba Bitalina Vargas Saritama
Ciclo: Séptimo
Bimestre: Segundo
The terms of a document are not equally useful for describing the document contents
In fact, there are index terms which are simply vaguer than others
There are properties of an index term which are useful for evaluating the importance of the term in a document
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Survey of natural language processing(midp2)Tariqul islam
Document classification is a part of Natural language processing. We have different methodology and technique for processing the document classification. The purpose of this article is to survey some papers related to document classification. Those survey will help the researcher to understand which will be the best approach to use for natural language processing
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
This paper presents a review & performs a comparative evaluation of few known machine learning
algorithms in terms of their suitability & code performance on any given data set of any size. In this paper,
we describe our Machine Learning ToolBox that we have built using python programming language. The
algorithms used in the toolbox consists of supervised classification algorithms such as Naïve Bayes,
Decision Trees, SVM, K-nearest Neighbors and Neural Network (Backpropagation). The algorithms are
tested on iris and diabetes dataset and are compared on the basis of their accuracy under different
conditions. However using our tool one can apply any of the implemented ML algorithms on any dataset of
any size. The main goal of building a toolbox is to provide users with a platform to test their datasets on
different Machine Learning algorithms and use the accuracy results to determine which algorithms fits the
data best. The toolbox allows the user to choose a dataset of his/her choice either in structured or
unstructured form and then can choose the features he/she wants to use for training the machine We have
given our concluding remarks on the performance of implemented algorithms based on experimental
analysis
Connections b/w active learning and model extractionAnmol Dwivedi
Codes on https://github.com/anmold-07/Model-Extraction-with-RL
https://www.usenix.org/conference/usenixsecurity20/presentation/chandrasekaran
This paper formalizes model extraction and discusses possible defense strategies by drawing parallels between model extraction and an established area of active learning. In particular, the authors show that recent advancements in the active learning domain can be used to implement powerful model extraction attacks and investigate possible defense strategies.
Quantification of Portrayal Concepts using tf-idf Weightingijistjournal
Term frequencies and inverse document frequencies have been successfully applied in determining
weighting for document rankings. However these have been more successful in text mining and in
extraction techniques used in the web. Concept mining has become increasingly popular in the research
and application areas of Computer Science. This paper attempts to demonstrate the limited usage of term
frequency and inverse document frequency for the application of weighting calculations for ranking
documents that are based on concept quantifications. The case study considered for experiment in this
paper, is based on concept terms of David Merrill’s First Principles of Instruction (FPI). Merrill’s FPI
applies cognitive structures explicitly for analyzing instructional materials. Therefore it is justified that the
terms categorized under each cognitive structure (or portrayal) of FPI can be taken as respective concept
of that portrayal. As question papers are representative of cognitive structures in a more clear and logical
way, four question papers on ‘C Language’ have been considered for the experimental study, that are
detailed in this paper. Manual method has been adopted for the computation of quantities of portrayals in
selected documents for the purpose of comparative study. As manual method is accurate, the values
(results) are considered as benchmark values. These benchmark values are considered for comparing with
normalized term frequencies that are derived (experimented) from automated extractions from the same
selected documents. The study is however limited to four documents only. Conclusions are drawn from this
experimental study, which will be of immense use to concept mining researchers as well as for instructional
designers.
Association rule discovery for student performance prediction using metaheuri...csandit
According to the increase of using data mining tech
niques in improving educational systems
operations, Educational Data Mining has been introd
uced as a new and fast growing research
area. Educational Data Mining aims to analyze data
in educational environments in order to
solve educational research problems. In this paper
a new associative classification technique
has been proposed to predict students final perform
ance. Despite of several machine learning
approaches such as ANNs, SVMs, etc. associative cla
ssifiers maintain interpretability along
with high accuracy. In this research work, we have
employed Honeybee Colony Optimization
and Particle Swarm Optimization to extract associat
ion rule for student performance prediction
as a multi-objective classification problem. Result
s indicate that the proposed swarm based
algorithm outperforms well-known classification tec
hniques on student performance prediction
classification problem.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. The information need can be understood as forming a pyramid, where only its peak is made visible by users in the form of a conceptual query.
Using Naive Bayesian Classifier for Predicting Performance of a Studentijtsrd
Data mining techniques play an important role in data analysis. For the construction of a classification model which could predict performance of students, particularly for engineering branches, a decision tree algorithm associated with the data mining techniques have been used in the research. A number of factors may affect the performance of students. Data mining technology which can relate to this student grade well and we also used classification algorithms prediction. We proposed student data classification using Naive Bayesian Classifier. Khin Khin Lay | Aung Cho "Using Naive Bayesian Classifier for Predicting Performance of a Student" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26463.pdfPaper URL: https://www.ijtsrd.com/computer-science/data-miining/26463/using-naive-bayesian-classifier-for-predicting-performance-of-a-student/khin-khin-lay
Universidad Técnica Particular de Loja
Ciclo Académico Abril Agosto 2011
Carrera: Inglés
Docente: Lic. Alba Bitalina Vargas Saritama
Ciclo: Séptimo
Bimestre: Segundo
The terms of a document are not equally useful for describing the document contents
In fact, there are index terms which are simply vaguer than others
There are properties of an index term which are useful for evaluating the importance of the term in a document
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
Most of the text classification problems are associated with multiple class labels and hence automatic text
classification is one of the most challenging and prominent research area. Text classification is the
problem of categorizing text documents into different classes. In the multi-label classification scenario,
each document is associated may have more than one label. The real challenge in the multi-label
classification is the labelling of large number of text documents with a subset of class categories. The
feature extraction and classification of such text documents require an efficient machine learning algorithm
which performs automatic text classification. This paper describes the multi-label classification of product
review documents using Structured Support Vector Machine.
This paper describes our work which is based on discovering context for text document categorization.
The document categorization approach is derived from a combination of a learning paradigm known
as relation extraction and an technique known as context discovery. We demonstrate the effectiveness
of our categorization approach using reuters 21578 dataset and synthetic real world data from sports
domain. Our experimental results indicate that the learned context greatly improves the categorization
performance as compared to traditional categorization approaches.
NLP Techniques for Text Classification.docxKevinSims18
Natural Language Processing (NLP) is an area of computer science and artificial intelligence that aims to enable machines to understand and interpret human language. Text classification is one of the most common tasks in NLP, and it involves categorizing text into predefined categories or classes. In this blog post, we will explore some of the most effective NLP techniques for text classification.
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
Manual Analysis of huge amount of textual data requires a tremendous amount of processing time and effort in reading the text and organizing them in required format. In the current scenario, the major problem is with text categorization because of the high dimensionality of feature space. Now-a-days there are many methods available to deal with text feature selection. This paper aims at such semi automated text categorization feature selection methodology to deal with a massive data using one of the phases of David Merrill’s First principles of instruction (FPI). It uses a pre-defined category group by providing them with the proper training set based on the demonstration phase of FPI. The methodology involves the text tokenization, text categorization and text analysis.
Similar to Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System (20)
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Gan Keng Hoon
This project covers several areas of research and development such as data storage and retrieval, domain analytics specification, natural language conversational modelling and query transformation. Prospective collaborators include dashboard solution providers, HPC centers and relevant interested parties.
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Gan Keng Hoon
https://doi.org/10.1007/978-981-13-8323-6_32
Conventional data analytics process uses dashboard with tables, charts, summaries, search tool in projecting its analysis outcome to its user with the goal of enabling discovery of useful information or suggesting conclusions to support decision-making. Such decision-making mechanisms can be improved further by using natural language interface in the dashboard components, e.g. using natural language keywords to search the sales performance of a product. Motivated by the needs to enable a user friendlier interaction with analytics outcome, this paper proposes a chatbot, called analytics bot who can assist in the role of decision making by delivering information of dashboard components with human like conversational pattern.
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
With the continuing growth of natural language texts from various sources like enterprise and personal web pages; online repositories such as news, Wikipedia, research articles; and social media like reviews, posts, blogs etc., text retrieval remains a challenging field today especially in terms of our quest for better search engine technologies. Unlike structured data, natural language text are generated by humans in forms like formal or informal writings, long or short texts, different languages, in order to express different kinds of knowledge. The diversities of these contents requires formal concepts and practical adaption that blends in both text and statistical analysis in terms of retrieval in search engine context. In this tutorial, we aim to provide the basic concepts and principles in text retrieval, especially in related to search engine technologies. The tutorial also gives a touch of the upcoming challenges of text retrieval in trending area like big data.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System
1. Category & Training Texts Selection for
Scientific Article Categorization in
an Expert Search System
By
Gan Keng Hoon*, Chua San Thai,
Khoh Zhuo Yan, Goh Kau Yang
School of Computer Sciences,
Universiti Sains Malaysia
2. Motivation
Scientific articles are produced as results of research.
Organizing scientific articles into subject areas or topics
help in discovery, navigation etc.
6. Scope
Application oriented research
Expert Search System
DBLP Dataset
School of Computer Sciences, USM
Goal
Improving the categorization of scientific articles
For
Capturing expert’s expertise based on their publications.
Enable category filtering during search.
7. Existing Approaches
Labelled Scientific Article
Supervised Learning method to train and test
Feature Selection
Bags of Words, Ngram, POS, Term Frequency, TFIDF
This research
Train with Labelled Scientific Related Domain Texts
Test with Scientific Article
8. Research Justification
Avoid the use of large number of labelled training texts
Focusing on differentiating good training texts sources.
Use reasonable small number of training texts to build
subject category model.
10. Feature Selection
Feature Term Generation
N-gram technique is used to generate potential term candidates from the training text. E.g.
D = “Search engine is an artificial intelligence system.”
2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] =>
artificial intelligence [5] => intelligence system)
Features Selection by TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword
weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as
features. This method penalizes the term when it occurs in different training texts. The TF-
IDF values are computed as
𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
= 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
× 𝑙𝑙𝑙𝑙 𝑙𝑙
𝑁𝑁𝐷𝐷
𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the
total number of document.
11. Transfer Training Approach
Intuition
If the training texts are representative enough to cover the concept of a
category, hence this training sets can be obtained from any sources that share
similar concepts or semantics.
Criteria
Sharing same or partially similar categories between two texts source.
The categories must bear the same concept or meaning.
The training source must be comprehensive to cover a category’s concept.
The training source must be available but not the testing source.
This approach is particular useful when the resources of unseen texts are not
readily available.
12. Training and Testing Category Model
The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category,
𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of
features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶
The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛,
the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶
Feature Similarity Scoring
The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of
features set of category model is viewed as a set of vectors in a vector space. Each term will have its
own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the
deviation angle between the vectors as follows.
𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 =
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
is the feature vector of a new document.
13. Evaluation Settings
Performance Metric
Scientific article is correctly assigned to a category or otherwise.
Expert judgement to evaluate.
Training Texts
Title and Abstract are used.
Tasks
Common (30 general cat) vs. Common + Specific Categories (30
general cat + 12 domain specific )
Automated Selection of Training Texts vs. Manual
14. Evaluation Results
Common categories
+ Automated
training texts (%)
Common and specific
categories + Automated
training texts (%)
Common and specific
categories + Manual
training texts (%)
Expert 1 62.50 68.75 81.25
Expert 2 46.67 46.67 53.33
Expert 3 33.33 33.33 66.67
Expert 4 33.33 41.67 41.67
Expert 5 43.75 37.50 28.13
(Average) (43.92) (45.59) (54.21)
15. Conclusion
Possibility
To train a category model using training texts from one source and apply
them on a different source.
Challenge
Selection of training texts as they could influence the accuracy of trained
model.
Limitation
Selection of categories, whereby the selected set is too little to cover the
domain’s (e.g. Computer Science) research area.
16. Thank You
For more of our work, please visit ir.cs.usm.my
Email me at khgan@usm.my