This is the presentation slides for the workshop BigScholar 2019 in conjunction with CIKM 2019 (ACM International Conference on Information and Knowledge Management) Nov 7, 2019, at CNCC, Beijing, China.
Citation: Kurakawa K, Sun Y and Ando S (2020) Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence. Front. Big Data 2:48. doi: 10.3389/fdata.2019.00048
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
K-Means, its Variants and its ApplicationsVarad Meru
This presentation was given by our project group at the Lead College competition at Shivaji University. Our project got the 1st Prize. We focused mainly on Rough K-Means and build a Social-Network-Recommender System based on Rough K-Means.
The Members of the Project group were -
Mansi Kulkarni,
Nikhil Ingole,
Prasad Mohite,
Varad Meru
Vishal Bhavsar.
Wonderful Experience !!!
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This paper proposes and examines the performance of a hybrid model called the wavelet radial bases function neural networks (WRBFNN). The model will be compared its performance with the wavelet feed forward neural networks (WFFN model by developing a prediction or forecasting system that considers two types of input formats: input9 and input17, and also considers 4 types of non-stationary time series data. The MODWT transform is used to generate wavelet and smooth coefficients, in which several elements of both coefficients are chosen in a particular way to serve as inputs to the NN model in both RBFNN and FFNN models. The performance of both WRBFNN and WFFNN models is evaluated by using MAPE and MSE value indicators, while the computation process of the two models is compared using two indicators, many epoch, and length of training. In stationary benchmark data, all models have a performance with very high accuracy. The WRBFNN9 model is the most superior model in nonstationary data containing linear trend elements, while the WFFNN17 model performs best on non-stationary data with the non-linear trend and seasonal elements. In terms of speed in computing, the WRBFNN model is superior with a much smaller number of epochs and much shorter training time.
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
A hard partition clustering algorithm assigns equally distant points to one of the clusters, where each datum has the probability to appear in simultaneous assignment to further clusters. The fuzzy cluster analysis assigns membership coefficients of data points which are equidistant between two clusters so the information directs have a place toward in excess of one cluster in the meantime. For a subset of CiteScore dataset, fuzzy clustering (fanny) and fuzzy c-means (fcm) algorithms were implemented to study the data points that lie equally distant from each other. Before analysis, clusterability of the dataset was evaluated with Hopkins statistic which resulted in 0.4371, a value < 0.5, indicating that the data is highly clusterable. The optimal clusters were determined using NbClust package, where it is evidenced that 9 various indices proposed 3 cluster solutions as best clusters. Further, appropriate value of fuzziness parameter m was evaluated to determine the distribution of membership values with variation in m from 1 to 2. Coefficient of variation (CV), also known as relative variability was evaluated to study the spread of data. The time complexity of fuzzy clustering (fanny) and fuzzy c-means algorithms were evaluated by keeping data points constant and varying number of clusters.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Due to continuous growth of the Internet technology, it needs to establish security mechanism. Intrusion Detection System (IDS) is increasingly becoming a crucial component for computer and network security systems. Most of the existing intrusion detection techniques emphasize on building intrusion detection model based on all features provided. Some of these features are irrelevant or redundant. This paper is proposed to identify important input features in building IDS that is computationally efficient and effective. In this paper, we identify important attributes for each attack type by analyzing the detection rate. We input the specific attributes for each attack types to classify using Naive Bayes, and Random Forest. We perform our experiments on NSL-KDD intrusion detection data set, which consists of selected records of the complete KDD Cup 1999 intrusion detection dataset.
This is the presentation slides for the joint conference of the 134th SIG conference of Information Fundamentals and Access Technologies (IFAT) and 112th SIG conference of Document Communication (DC), Information Processing Society of Japan (IPSJ)March 22, 2019, at Toyo University, Hakusan Campus.
Cite: Kei Kurakawa, Yuan Sun, and Satoko Ando, Applying a new subject classification scheme for a database by a data-driven correspondence, IPSJ SIG Technical Report, Vol.2019-IFAT-134/2019-DC-112, No.7, pp.1-10, (2019).
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Postermultimediaeval
This paper details the participation of the UNED-UV group at the 2015 Retrieving Diverse Social Images Task. This year, our proposal is based on a multi-modal approach that firstly applies a textual algorithm based on Formal Concept Analysis (FCA) and Hierarchical Agglomerative Clustering (HAC) to detect the latent topics addressed by the images to diversify them according to these topics. Secondly, a Local Logistic Regression model, which uses the low level features and some relevant and non-relevant samples, is adjusted and estimates the relevance probability for all the images in the database.
http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
Spectral Clustering and Vantage Point Indexing for Efficient Data Retrieval IJECEIAES
Data mining is an essential process for identifying the patterns in large datasets through machine learning techniques and database systems. Clustering of high dimensional data is becoming very challenging process due to curse of dimensionality. In addition, space complexity and data retrieval performance was not improved. In order to overcome the limitation, Spectral Clustering Based VP Tree Indexing Technique is introduced. The technique clusters and indexes the densely populated high dimensional data points for effective data retrieval based on user query. A Normalized Spectral Clustering Algorithm is used to group similar high dimensional data points. After that, Vantage Point Tree is constructed for indexing the clustered data points with minimum space complexity. At last, indexed data gets retrieved based on user query using Vantage Point Tree based Data Retrieval Algorithm. This in turn helps to improve true positive rate with minimum retrieval time. The performance is measured in terms of space complexity, true positive rate and data retrieval time with El Nino weather data sets from UCI Machine Learning Repository. An experimental result shows that the proposed technique is able to reduce the space complexity by 33% and also reduces the data retrieval time by 24% when compared to state-of-the-artworks.
K-Means, its Variants and its ApplicationsVarad Meru
This presentation was given by our project group at the Lead College competition at Shivaji University. Our project got the 1st Prize. We focused mainly on Rough K-Means and build a Social-Network-Recommender System based on Rough K-Means.
The Members of the Project group were -
Mansi Kulkarni,
Nikhil Ingole,
Prasad Mohite,
Varad Meru
Vishal Bhavsar.
Wonderful Experience !!!
System for Prediction of Non Stationary Time Series based on the Wavelet Radi...IJECEIAES
This paper proposes and examines the performance of a hybrid model called the wavelet radial bases function neural networks (WRBFNN). The model will be compared its performance with the wavelet feed forward neural networks (WFFN model by developing a prediction or forecasting system that considers two types of input formats: input9 and input17, and also considers 4 types of non-stationary time series data. The MODWT transform is used to generate wavelet and smooth coefficients, in which several elements of both coefficients are chosen in a particular way to serve as inputs to the NN model in both RBFNN and FFNN models. The performance of both WRBFNN and WFFNN models is evaluated by using MAPE and MSE value indicators, while the computation process of the two models is compared using two indicators, many epoch, and length of training. In stationary benchmark data, all models have a performance with very high accuracy. The WRBFNN9 model is the most superior model in nonstationary data containing linear trend elements, while the WFFNN17 model performs best on non-stationary data with the non-linear trend and seasonal elements. In terms of speed in computing, the WRBFNN model is superior with a much smaller number of epochs and much shorter training time.
Fuzzy clustering and fuzzy c-means partition cluster analysis and validation ...IJECEIAES
A hard partition clustering algorithm assigns equally distant points to one of the clusters, where each datum has the probability to appear in simultaneous assignment to further clusters. The fuzzy cluster analysis assigns membership coefficients of data points which are equidistant between two clusters so the information directs have a place toward in excess of one cluster in the meantime. For a subset of CiteScore dataset, fuzzy clustering (fanny) and fuzzy c-means (fcm) algorithms were implemented to study the data points that lie equally distant from each other. Before analysis, clusterability of the dataset was evaluated with Hopkins statistic which resulted in 0.4371, a value < 0.5, indicating that the data is highly clusterable. The optimal clusters were determined using NbClust package, where it is evidenced that 9 various indices proposed 3 cluster solutions as best clusters. Further, appropriate value of fuzziness parameter m was evaluated to determine the distribution of membership values with variation in m from 1 to 2. Coefficient of variation (CV), also known as relative variability was evaluated to study the spread of data. The time complexity of fuzzy clustering (fanny) and fuzzy c-means algorithms were evaluated by keeping data points constant and varying number of clusters.
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
Artificial intelligence is emerging as a new paradigm in materials science. This talk describes how physical intuition and (insightful) machine learning can solve the complicated task of structure recognition in materials at the nanoscale.
Due to continuous growth of the Internet technology, it needs to establish security mechanism. Intrusion Detection System (IDS) is increasingly becoming a crucial component for computer and network security systems. Most of the existing intrusion detection techniques emphasize on building intrusion detection model based on all features provided. Some of these features are irrelevant or redundant. This paper is proposed to identify important input features in building IDS that is computationally efficient and effective. In this paper, we identify important attributes for each attack type by analyzing the detection rate. We input the specific attributes for each attack types to classify using Naive Bayes, and Random Forest. We perform our experiments on NSL-KDD intrusion detection data set, which consists of selected records of the complete KDD Cup 1999 intrusion detection dataset.
This is the presentation slides for the joint conference of the 134th SIG conference of Information Fundamentals and Access Technologies (IFAT) and 112th SIG conference of Document Communication (DC), Information Processing Society of Japan (IPSJ)March 22, 2019, at Toyo University, Hakusan Campus.
Cite: Kei Kurakawa, Yuan Sun, and Satoko Ando, Applying a new subject classification scheme for a database by a data-driven correspondence, IPSJ SIG Technical Report, Vol.2019-IFAT-134/2019-DC-112, No.7, pp.1-10, (2019).
MediaEval 2015 - UNED-UV @ Retrieving Diverse Social Images Task - Postermultimediaeval
This paper details the participation of the UNED-UV group at the 2015 Retrieving Diverse Social Images Task. This year, our proposal is based on a multi-modal approach that firstly applies a textual algorithm based on Formal Concept Analysis (FCA) and Hierarchical Agglomerative Clustering (HAC) to detect the latent topics addressed by the images to diversify them according to these topics. Secondly, a Local Logistic Regression model, which uses the low level features and some relevant and non-relevant samples, is adjusted and estimates the relevance probability for all the images in the database.
http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
As we may link: a model to support aggregated scientific knowledgePrashant Gupta
Today, researchers are bogged down by continually growing amount of complex and diverse scientific knowledge, fragmented and dispersed among various disciplines, communities and information resources. Contemporary digital tools are efficient in dealing with complexity and diversity of scientific knowledge and the process of science, but they have compartmentalized scientific knowledge among various disparate and disconnected systems. For example, databases are used to structure data to facilitate its easy retrieval; workflows are used to represent the process of experiments; analytical tools are used to support analyzing data and visualization tools to visualize data and results to gain better understanding. However, they rarely connect or join together to synthesize an integrated view. Our digital knowledge ecosystem is siloed and poses a challenge for researchers to search, comprehend and reproduce scientific experiments.
Vannevar Bush, in his article ‘As we may think’, discussed the huge data and information deluge and the challenge brought by the fragmentary nature of scientific knowledge. He proposed an imaginary machine – Memex – that could tie knowledge records in a mesh of associative trails, which can be reviewed and consulted as a form of graph search. This talk will discuss a model that adopts Bush’s associationist view to integrate scientific knowledge. Categories are commonly used in databases (in the form of logical schema) and ontologies (as concepts and properties), but often these artifacts are disconnected from eachother. The proposed model connects categories, along with their process of construction and evolution, with a database and ontology via tools that support their evolution. Connecting these knowledge artifacts (via their digital tools) explicitly not only provides an integrated view, but may also be capitalized to support mediation among these artifacts and keeping them consistent with new conceptualization. Such mediation among scientific artifacts will reconnect the computationally enabled science and the knowledge underpinning it.
In the last decade, several Scientific Knowledge Graphs (SKG) were released, representing scientific knowledge in a structured, interlinked, and semantically rich manner. But, what kind of information they describe? How they have been built? What can we do with them? In this lecture, I will first provide an overview of well-known SKGs, like Microsoft Academic Graph, Dimensions, and others. Then, I will present the Academia/Industry DynAmics (AIDA) Knowledge Graph, which describes 21M publications and 8M patents according to i) the research topics drawn from the Computer Science Ontology, ii) the type of the author's affiliations (e.g, academia, industry), and iii) 66 industrial sectors (e.g., automotive, financial, energy, electronics) from the Industrial Sectors Ontology (INDUSO). Finally, I will showcase a number of tools and approaches using such SKGs, supporting researchers, companies, and policymakers in making sense of research dynamics.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
A glimpse of how we are used to connecting datasets on our laptops and how, imho, need to move to the Web of Data, including a demo connecting various sources all from your(!) machine.
Extreme weather events pose great potential risk on ecosystem, infrastructure and human health. Analyzing extreme weather in the observed record (satellite, reanalysis products) and characterizing changes in extremes in simulations of future climate regimes is an important task. Thus far, extreme weather events have been typically specified by the community through hand-coded, multi-variate threshold conditions. Such criteria are usually subjective, and often there is little agreement in the community on the specific algorithm that should be used. We propose the use of a different approach: machine learning (and in particular deep learning) for solving this important problem. If human experts can provide spatio-temporal patches of a climate dataset, and associated labels, we can turn to a machine learning system to learn the underlying feature representation. The trained Machine Learning (ML) system can then be applied to novel datasets, thereby automating the pattern detection step. Summary statistics, such as location, intensity and frequency of such events can be easily computed as a post-process.
We will report compelling results from our investigations of Deep Learning for the tasks of classifying tropical cyclones, atmospheric rivers and weather front events. For all of these events, we observe 90-99% classification accuracy. We will also report on progress in localizing such events: namely drawing a bounding box (of the correct size and scale) around the weather pattern of interest. Both tasks currently utilize multi-layer convolutional networks in conjunction with hyper-parameter optimization. We utilize HPC systems at NERSC to perform the optimization across multiple nodes, and utilize highly-tuned libraries to utilize multiple cores on a single node. We will conclude with thoughts on the frontier of Deep Learning and the role of humans (vis-a-vis AI) in the scientific discovery process.
Presentation held by Lim Ying Sean, Arun Anand Sadanandan, Dickson Lukose and Klaus Tochtermann at the Agricultural Ontology Service (AOS) Workshop 2012 in Kutching, Sarawak, Malaysia from September 3 - 4, 2012
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The process of classifying scholarly outputs is crucial to ensure timely access to knowledge. However, this process is typically carried out manually by expert editors, leading to high costs and slow throughput. In this paper we present Smart Topic Miner (STM), a novel solution which uses semantic web technologies to classify scholarly publications on the basis of a very large automatically generated ontology of research areas. STM was developed to support the Springer Nature Computer Science editorial team in classifying proceedings in the LNCS family. It analyses in real time a set of publications provided by an editor and produces a structured set of topics and a number of Springer Nature classification tags, which best characterise the given input. In this paper we present the architecture of the system and report on an evaluation study conducted with a team of Springer Nature editors. The results of the evaluation, which showed that STM classifies publications with a high degree of accuracy, are very encouraging and as a result we are currently discussing the required next steps to ensure large-scale deployment within the company.
The Computer Science Ontology: A Large-Scale Taxonomy of Research AreasAngelo Salatino
Ontologies of research areas are important tools for characterising, exploring, and analysing the research landscape. Some fields of research are comprehensively described by large-scale taxonomies, e.g., MeSH in Biology and PhySH in Physics. Conversely, current Computer Science taxonomies are coarse-grained and tend to evolve slowly. For instance, the ACM classification scheme contains only about 2K research topics and the last version dates back to 2012. In this paper, we introduce the Computer Science Ontology (CSO), a large-scale, automatically generated ontology of research areas, which includes about 15K topics and 70K semantic relationships. It was created by applying the Klink-2 algorithm on a very large dataset of 16M scientific articles. CSO presents two main advantages over the alternatives: i) it includes a very large number of topics that do not appear in other classifications, and ii) it can be updated automatically by running Klink-2 on recent corpora of publications. CSO powers several tools adopted by the editorial team at Springer Nature and has been used to enable a variety of solutions, such as classifying research publications, detecting research communities, and predicting research trends. To facilitate the uptake of CSO we have developed the CSO Portal, a web application that enables users to download, explore, and provide granular feedback on CSO at different levels. Users can use the portal to rate topics and relationships, suggest missing relationships, and visualise sections of the ontology. The portal will support the publication of and access to regular new releases of CSO, with the aim of providing a comprehensive resource to the various communities engaged with scholarly data.
ACT Talk, Giuseppe Totaro: High Performance Computing for Distributed Indexin...Advanced-Concepts-Team
Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.
Presentation slide for this:
Kei Kurakawa, Toward universal information access on the digital object cloud, In book of abstracts of International Workshop on Data Science - Present & Future of Open Data & Open Science -, p.57-59, November 12-15, 2018, Mishima Citizens Cultural Hall & Joint Support-Center for Data Science Research, Mishima, Shizuoka, Japan
International Workshop on Sharing, Citation and Publication of Scientific Data across Disciplines
Joint Support-Center for Data science Research (DS), ROIS
NIPR / NINJAL, Tachikawa, Tokyo, Japan, 5-7 December 2017.
Analysis and Modeling of Complex Data in Behavioral and Social Sciences
Joint meeting of Japanese and Italian Classification Societies
Anacapri (Capri Island, Italy), 3-4 September 2012
OR2012, The 7th international conference on Open Repositories
09 - 13/Jul/2012, the University of Edinburgh, UK
RF3: Pecha Kucha – National Infrastructures, 11/Jul/2012: 11:00am – 12:30pm
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Adjusting primitives for graph : SHORT REPORT / NOTES
Application of a Novel Subject Classification Scheme for a Bibliographic Database Using a Data-Driven Correspondence
1. Application of
a Novel Subject Classification Scheme
for a Bibliographic Database
Using a Data-Driven Correspondence
Kei Kurakawa, Yuan Sun
National Institute of Informatics, Japan
Satoko Ando
Clarivate Analytics (Japan) Co., Ltd.
This is the presentation slides for the workshop BigScholar 2019 in conjunction with CIKM 2019 (ACM International Conference
on Information and Knowledge Management) Nov 7, 2019, at CNCC, Beijing, China.
Citation: Kurakawa K, Sun Y and Ando S (2020) Application of a Novel Subject Classification Scheme for a Bibliographic Database
Using a Data-Driven Correspondence. Front. Big Data 2:48. doi: 10.3389/fdata.2019.00048
2. Overview
• Introduction
• Motivation
• Applying a new subject classification scheme for a subject-classified bibliographic database
• Our main contributions
• Related work
• Theoretical background
• Subject classification model of the bibliographic database based on set theory
• Main steps of our data-driven approach
• Case study
• Applying the Japanese grants KAKENHI subject classification scheme for the Web of
Science citation database
• Conclusions and future work
2
3. Motivation
• In assessing research activities based on bibliometrics, analysts are
accustomed to use the major citation database Web of Science whose
subject classification schemes, i.e. WoS Subject Category, ESI, and
GIPP are prepared for qualitative analysis.
• Analysts need domestic subject classification schemes for their
analysis, which are not implemented on the database.
• Applying a new classification scheme for the database by hand is too
much labor intensive and time consuming task.
• How can we apply a new classification scheme for the database,
efficiently and effectively?
3
4. Our main contributions
• We propose an approach to apply a novel subject classification
scheme for a subject-classified database using a data-driven
correspondence between the new and present ones, which is
accustomed to digital libraries.
• We give a fundamental analytical model of subject classification
scheme based on set theory and describe compact topological space
formation for a new subject classification scheme as a necessary
condition.
• We demonstrate the effectiveness and efficiency of our approach to a
practical bibliographic database.
4
5. Related work
• In the field of computer science,
• Information retrieval
• Data mining
• Digital libraries
• Automated text categorization
• Classification (supervised learning)
• Naïve bays classification
• Neural networks
• Support vector machines
• Clustering (unsupervised learning)
• K-means
• Expectation maximization (EM)
• Hierarchical agglomerative clustering
• Divisive clustering
• Matrix decompositions
• More problem specific method
• Multi-label classification / multi-label
learning, based on
• SVM
• Deep learning
• Ensemble classification.
• Extreme multi-label classification, based on
• Graph embedding
• Convolutional neural network (CNN)
• Attention model of neural networks
• Label hierarchy considered
• A method of mapping between different
classification schemes
• Importing cataloguing records using a
different classification scheme in digital
libraries
• Information integration on the Web
5
6. Theoretical background
• Subject classification model of the bibliographic database
• Compact topological space formation for a new subject classification
scheme
• Inducing a correspondence between two subject classification
schemes using a research project database
6
12. Given a finite cover
𝑆
Compact topological space12
𝔒(1)
= {𝑂𝑖}
𝑆, 𝔒 1
13. Another set of categories
𝑆
Compact topological space13
𝔒(2)
= {𝑂𝑖}
𝑆, 𝔒 2
14. If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
Compact topological space
𝑇, 𝔒 𝑇
2
14
15. If we have an external database such as …
• Research project database
𝑆′
𝑆
𝑇
𝑏
𝑂
ℎ
articles
projects
Compact topological space
𝑇, 𝔒 𝑇
2
Compact topological space
𝑆′
, 𝔒 𝑆′
2
We can define a compact
topological space for the
second set of categories.
15
16. Compact topological spaces for the two
subject classification schemes
𝑆′
𝔒 1
= {𝑂𝑖
1
}
Compact topological space
𝑆′
, 𝔒 𝑆′
2
𝑆′
, 𝔒 𝑆′
1
𝔒 2
= {𝑂𝑖
2
}
16
21. Main steps of our approach
1’-2. Inducing a correspondence between the two
subject classification schemes by using pseudo
𝐹𝛽-measure
1’-1. Constructing a contingency table
between two subject classification
schemes
2. Revising the correspondence to guarantee the existence of a finite
cover of the novel subject classification scheme
1. Inducing a
correspondence between
the two subject
classification schemes by
using 𝐹𝛽-measure
21
22. Case study
• InCites™ (Clarivate Analytics)
• A world class research evaluation platform
• Web of Science™ citation database
• Web of Science classification scheme (251 categories)
• Essential Science Indicator(ESI) classification scheme
(22 categories)
• Japanese users are eager to utilize the subject
classification scheme of Japan’s largest national
research grants KAKENHI.
• KAKEN (NII) research project database
• Archival records of research projects and the
outputs of KAKENHI grants in Japan.
• KAKENHI subject classification scheme (hierarchical
classification scheme; 4 categories, 10 areas, 67
disciplines, and 284 research fields)
22
25. Developing a contingency table as evidence
data
• We identified the same bibliographic records in the WoS citation database as of 2009 and
2010 through a set of record linkage techniques to obtain a set of articles 𝑆′ that are
classified using both the KAKENHI and WoS classification schemes.
𝑆′
𝑆𝑇
𝑏
𝑂
ℎ
Web of ScienceKAKEN
𝑎1
𝑎2
𝑎3
𝑎1
′
𝑎2
′
𝑎3
′
articles articlesprojects
Bibliographic linkage
𝑎1
′
≡ 𝑎1
𝑎2
′
≡ 𝑎2
𝑎3
′
≡ 𝑎3
25
26. A contingency table between WoS and
KAKENHI subject classification schemes
a part of 251 WoS categories
x 67 KAKENHI areas
26
27. Analysis of the contingency table
27
,where is the rank value, its maximum value,
a normalized constant
and two fitting components.
The discrete generalized beta distribution (DGBD)
31. 31
Average
cardinality
Average no. of WoS
subject categories
Average pseudo
precision
Average pseudo
recall
Average pseudo
F1 measure
1450.4 6.1 0.315 0.367 0.317
The third-level
67 disciplines
seq. no.
KAKENHI subject category Translation Cardinality No. of WoS subject
categories to get the
max pseudo F1-measure
Pseudo precision Pseudo recall Max pseudo F1
measure
(l3-46) 材料工学 Material Engineering 2931 6 0.348 0.523 0.418
(l3-47) プロセス工学 Process/Chemical Engineering 1283 4 0.145 0.306 0.197
(l3-48) 総合工学 Integrated Engineering 1465 8 0.256 0.309 0.280
(l3-49) 基礎生物学 Basic Biology 2423 7 0.375 0.400 0.387
(l3-50) 生物科学 Biological Science 2679 4 0.167 0.582 0.259
(l3-51) 人類学 Anthropology 300 3 0.315 0.440 0.367
(l3-52) 農学 Plant Production and
Environmental Agriculture
899 4 0.307 0.449 0.365
(l3-53) 農芸化学 Agricultural Chemistry 1755 6 0.220 0.386 0.281
(l3-54) 林学 Forest and Forest Products
Science
559 5 0.408 0.252 0.312
(l3-55) 水産学 Applied Aquatic Science 581 2 0.419 0.327 0.367
(l3-56) 農業経済学 Agricultural Science in Society
and Economy
31 2 0.333 0.097 0.150
(l3-57) 農業工学 Agro-Engineering 216 4 0.157 0.259 0.195
(l3-58) 畜産学・獣医学 Animal Life Science 1190 4 0.511 0.387 0.440
(l3-59) 境界農学 Boundary Agriculture 541 4 0.235 0.148 0.181
(l3-60) 薬学 Pharmacy 3457 4 0.294 0.369 0.328
(l3-61) 基礎医学 Basic Medicine 5232 16 0.213 0.551 0.307
(l3-62) 境界医学 Boundary Medicine 850 12 0.162 0.112 0.132
(l3-63) 社会医学 Society Medicine 1065 8 0.282 0.262 0.271
32. Miscellaneous considerations
• Decision by an expert
• Limit the number of correspondence to 1 – 4 for 𝑂𝑖
1
.
• For every Web of Science subject category 𝑂𝑖
1
, the number of relations with KAKENHI
subject categories 𝑂𝑗
2
is limited to 4 at most.
• For every Web of Science subject category 𝑂𝑖
1
, when the recall rate exceeds a half, we
stop adding any more relation.
• Check all correspondence between 𝑂𝑖
1
and 𝑂𝑗
2
.
• Add or remove correspondence relations between them by means of subject
classification keywords.
32
34. Example screen of InCites™
34
WoS Documents: 58,395,008
for Web of Science subject categories
WoS Documents: 3,192,449
for Web of Science subject categories
limited with
“LOCATION = JAPAN”
WoS Documents: 3,191,448
for KAKEN L3 subject categories
limited with
“LOCATION = JAPAN”
(a snapshot of 2018-12-14)
The bubbles representing
proportional numbers of
articles classified using the
KAKENHI subject categories
35. Top 30 subject distribution of Japanese authors’
articles with the two subject classification
schemes
WoS subject classification scheme KAKENHI subject classification scheme
35
36. User feedback: Questions and answers on the
validity of the KAKENHI subject classification
scheme
• KAKEN classification scheme
• April 2016, released on InCites Benchmarking
• User survey
• March 2017 by online questionnaire for
institutional active users
• 18 questions
• Results
• 26 institutional users feedback
• Q7
• Which levels of hierarchy in KAKENHI subject
classification scheme do you need?
• Q11
• Do you feel comfortable with your analysis
results by KAKENHI subject classification scheme
in accordance with your experience?
User role in the institution Yes (multiple
answers possible)
RA (research administrator) 20
Administrator / officer 3
IR (institutional research) staff 5
Others 2
Other: 1, I need more detail categories
36
37. Discussion (1)
• Our approach, i.e. deciding a correspondence between two subject
classification schemes has an inherent limitation.
• In natural correlations between subject categories of two subject classification
schemes, each subject category of one scheme partly overlaps several subject
categories of the other scheme.
• There is no inclusion relationship between them.
• Correspondence relations are probabilistic.
• Research projects and journal articles have similarities and differences on
subject.
• Projects and articles have a strong correlation on subject.
• But, they also have differences on subject.
• Projects precede articles.
• Projects tend to indicate the central concept with essential keywords.
37
38. Discussion (2)
• Nevertheless, the classification results were accepted by InCites
users.
• Our approach requires less workload .
• The numbers of journal titles in Web of Science citation database is 24,688.
• The number of Web of Science documents of InCites is 58,395,008.
• The number of subject category pairs to decide a correspondence is 16,817.
• For KAKEN 67 - WoS 251, the number of the pairs is 16,817.
• For KAKEN 10 - WoS 251, the number of the pairs is 2,510.
• But, evidence data is not sufficient to automatic decision making.
• The sum of frequency counts of the contingency table is 97,175.
• Manual handling was needed.
38
39. Conclusions and future work
• Conclusions
• We proposed an approach to apply a new subject classification scheme for a bibliographic
database that is already classified by using a subject classification scheme.
• We gave a fundamental analytical model of subject classification scheme based on set theory.
• Compact topological space formation for a new subject classification scheme is a necessary condition.
• An external database, e.g. research project database is utilized to induce a correspondence between the
two subject classification schemes.
• We applied the approach to a practical example, InCites™ that is a research evaluation tool
based on the Web of Science citation database to add the subject classification scheme of
Japan’s largest national grants KAKENHI. The user survey indicates that users generally accept
the new function.
• Future work
• For a complex classification scheme such as a hierarchical classification scheme, our
approach should be extended to be applied to its character.
• Alternatively, multilabel learning is another possible method to aim at our goal. We need to
compare it to our method.
39
40. Acknowledgments
• This presentation is a result of a joint research between National
Institute of Informatics and Clarivate Analytics, Co., Ltd. As for the
databases we used in this presentation, the KAKEN database is
provided by National Institute of Informatics, Cyber Science
Infrastructure Development Department, Scholarly and Academic
Information Division, and the Web of Science citation database is
provided by Clarivate Analytics, Co., Ltd. We are thankful to the
organizations who let us use the valuable assets.
40
Editor's Notes
intersection, set difference, union,
Another set of categories is unknown for the set S.
We want to specify the compact topological space for S.
As a data-driven approach, …
Given a research project database, we can observe compact topological spaces for the two subject classification schemes.
Strategic position to induce a correspondence is to maximize the F-measure.
The pseudo metrics are different from the original metrics because of subadditivity.