Most work on scholarly document processing assumes that the information processed is trustworthy and factually correct. However, this is not always the case. There are two core challenges, which should be addressed: 1) ensuring that scientific publications are credible -- e.g. that claims are not made without supporting evidence, and that all relevant supporting evidence is provided; and 2) that scientific findings are not misrepresented, distorted or outright misreported when communicated by journalists or the general public. I will present some first steps towards addressing these problems and outline remaining challenges.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
Sebastian Ruder gave a presentation on transfer learning in machine learning. He began by defining transfer learning as applying knowledge gained from solving one problem to a different but related problem. Transfer learning is now important because machine learning models have matured and are being widely deployed, but often lack labeled data for new tasks or domains. Ruder discussed examples of transfer learning in computer vision and natural language processing. He described his research focus on finding better ways to transfer knowledge between domains, tasks, and languages in large-scale, real-world applications.
This document presents a method for clustering citation distributions of authors to categorize them semantically and predict future citations. It uses hierarchical clustering with normalized Euclidean distance on citation distributions. Clusters are evaluated based on homogeneity of citation patterns over time. Semantic features of author bibliometric data are represented using the BiDO ontology to link numeric and categorical data over time. The method was evaluated on a dataset of 20,000 computer scientists from 1990-2010. Future work involves augmenting features, applying it to groups, extending the ontology, and creating a linked bibliometric triplestore.
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET Journal
This document presents a methodology for classifying mined online discussion data to identify reflective thinking based on ontology. It involves the following steps:
1. Collecting online discussion data and preprocessing it by removing stop words and punctuation.
2. Implementing inductive content analysis to categorize the data into six types of reflective thinking.
3. Training a Naive Bayes classifier on the categorized data to classify new data.
4. Applying the trained model to large scale unlabeled online discussion data.
5. Using ontology to provide a deeper classification of topics in the data beyond the six reflective thinking categories. This allows extraction of additional knowledge from the classified text data.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Sc...Isabelle Augenstein
Shared task summary for SemEval 2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific Publications
Paper: https://arxiv.org/abs/1704.02853
Abstract:
We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
The document discusses machine reading using neural machines. It presents goals of fact checking claims and understanding scientific publications. It outlines challenges in tasks like stance detection on tweets and summarizing scientific papers. These include interpreting statements based on the target or headline, handling unseen targets, and the small size of benchmark datasets which makes neural machine reading computationally costly.
Semantic annotation is done through first representing words and documents in the vector space model using Word2Vec and Doc2Vec implementations, the vectors are taken as features into a classifier, trained and a model is made which can classify a document with ACM classification tree categories, with the help of Wikipedia corpus.
Project Presentation: https://youtu.be/706HJteh1xc
Project Webpage: http://rohitsakala.github.io/semanticAnnotationAcmCategories/
Source Code: https://github.com/rohitsakala/semanticAnnotationAcmCategories
References:
Quoc V. Le, and Tomas Mikolov, ''Distributed Representations of Sentences and Documents ICML", 2014
Transfer Learning -- The Next Frontier for Machine LearningSebastian Ruder
Sebastian Ruder gave a presentation on transfer learning in machine learning. He began by defining transfer learning as applying knowledge gained from solving one problem to a different but related problem. Transfer learning is now important because machine learning models have matured and are being widely deployed, but often lack labeled data for new tasks or domains. Ruder discussed examples of transfer learning in computer vision and natural language processing. He described his research focus on finding better ways to transfer knowledge between domains, tasks, and languages in large-scale, real-world applications.
This document presents a method for clustering citation distributions of authors to categorize them semantically and predict future citations. It uses hierarchical clustering with normalized Euclidean distance on citation distributions. Clusters are evaluated based on homogeneity of citation patterns over time. Semantic features of author bibliometric data are represented using the BiDO ontology to link numeric and categorical data over time. The method was evaluated on a dataset of 20,000 computer scientists from 1990-2010. Future work involves augmenting features, applying it to groups, extending the ontology, and creating a linked bibliometric triplestore.
The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
IRJET-Classifying Mined Online Discussion Data for Reflective Thinking based ...IRJET Journal
This document presents a methodology for classifying mined online discussion data to identify reflective thinking based on ontology. It involves the following steps:
1. Collecting online discussion data and preprocessing it by removing stop words and punctuation.
2. Implementing inductive content analysis to categorize the data into six types of reflective thinking.
3. Training a Naive Bayes classifier on the categorized data to classify new data.
4. Applying the trained model to large scale unlabeled online discussion data.
5. Using ontology to provide a deeper classification of topics in the data beyond the six reflective thinking categories. This allows extraction of additional knowledge from the classified text data.
Transfer learning aims to improve learning outcomes for a target task by leveraging knowledge from a related source task. It does this by influencing the target task's assumptions based on what was learned from the source task. This can allow for faster and better generalized learning in the target task. However, there is a risk of negative transfer where performance decreases. To avoid this, methods examine task similarity and reject harmful source knowledge, or generate multiple mappings between source and target to identify the best match. The goal of transfer learning is to start higher, learn faster, and achieve better overall performance compared to learning the target task without transfer.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
This document describes a graph-based approach for skill extraction from text. It discusses expertise retrieval and previous related work. It then describes the Elisit system, which uses Wikipedia pages and spreading activation on Wikipedia's hyperlink network to associate skills with input documents. Sample queries to the system are provided. The method associates documents with Wikipedia pages and then performs spreading activation to find related skills. Evaluation shows biasing the spreading activation improves results.
ICML 2018 included papers on generative models, music and audio applications, and AI security. On generative models, papers explored topics like learning many-to-many mappings between domains, joint distribution learning, and reducing amortization gaps in VAEs. In music, works examined hierarchical latent space models for music structure and style transfer for speech synthesis. Regarding security, studies analyzed adversarial attacks across domains, the threat of adversarial examples, and circumventing defenses through obfuscated gradients.
This document summarizes work done by the Julia Labs at MIT on genomics data analysis and optimization of principal component analysis algorithms for genome-wide association studies. It describes how a native Julia implementation of PCA was able to reduce computation time for finding the top 10 principal components of a 80,000x40,000 genotype matrix from over 2,900 seconds to just 81 seconds. It also discusses how custom matrix-vector multiplication functions allowed the same computation speed while using 32x less memory by reading directly from a compressed data format. Future work directions include more complex analytics, improved data imputation methods, out-of-core matrix operations, and accessing different data formats.
The ontology engineering research community has focused for many years on supporting the creation, development and evolution of ontologies. Ontology forecasting, which aims at predicting semantic changes in an ontology, represents instead a new challenge. In this paper, we want to give a contribution to this novel endeavour by focusing on the task of forecasting semantic concepts in the research domain. Indeed, ontologies representing scientific disciplines contain only research topics that are already popular enough to be selected by human experts or automatic algorithms. They are thus unfit to support tasks which require the ability of describing and exploring the forefront of research, such as trend detection and horizon scanning. We address this issue by introducing the Semantic Innovation Forecast (SIF) model, which predicts new concepts of an ontology at time t+1 , using only data available at time t. Our approach relies on lexical innovation and adoption information extracted from historical data. We evaluated the SIF model on a very large dataset consisting of over one million scientific papers belonging to the Computer Science domain: the outcomes show that the proposed approach offers a competitive boost in mean average precision-at-ten compared to the baselines when forecasting over 5 years.
Question Answering System using machine learning approachGarima Nanda
In a compact form, this is a presentation reflecting how the machine learning approach can be used for the effective and efficient interaction using classification techniques.
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Punit Sharnagat
OSMnx is a Python package to retrieve, model, analyze, and visualize street networks from OpenStreetMap.
OpenStreetMap (OSM) is a collaborative mapping project that provides a free and publicly editable map of the world.
OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis
A Julia package for iterative SVDs with applications to genomics data analysisJiahao Chen
This document discusses a Julia package for iterative singular value decompositions (SVDs) with applications to genomics data analysis. It introduces SVD and its use in genome-wide association studies to identify genetic factors associated with diseases and traits. It summarizes different iterative SVD methods like Lanczos iteration and challenges like loss of orthogonality. The document presents a new Julia package called FlashPCA that uses a blocked power method to quickly approximate the top SVD components, and compares its performance to other iterative SVD solvers on large genomic datasets.
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...IRJET Journal
This document discusses using a Naive Bayes classifier to identify the original writer of an anonymous text. It begins with an introduction to the problem and an overview of Naive Bayes classification. It then describes how a text can be represented as a bag-of-words with word frequencies and explains how Naive Bayes can be used to calculate the posterior probability and predict the class. The document compares the performance of Naive Bayes to other algorithms using an email dataset, finding that Naive Bayes has the highest performance factor due to its fast training and prediction times relative to its accuracy. It concludes that Naive Bayes can accurately predict the original writer of anonymous texts.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
This document provides an overview of the CSE 591: Machine Learning and Applications course taught by Dr. Jieping Ye at Arizona State University. The following key points are discussed:
- Course information including instructor, time/location, prerequisites, objectives to provide an understanding of machine learning methods and applications.
- Topics covered include clustering, classification, dimensionality reduction, semi-supervised learning, and kernel learning.
- The grading breakdown includes homework, a group project, and an exam. Students are required to participate in class discussions.
- An introduction to machine learning is provided including definitions of supervised vs. unsupervised learning and applications in domains like bioinformatics.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
Integrating natural language processing and software engineeringNakul Sharma
This document summarizes research on integrating natural language processing and software engineering. It provides a literature review of works that have used natural language text as input to generate software engineering artifacts like UML diagrams, test cases, and process models. The paper also discusses how techniques from natural language processing can be applied to different phases of the software development life cycle and how natural language understanding can help automate software engineering tasks.
Resource Allocation Using Metaheuristic Searchcsandit
This document discusses using metaheuristic search techniques to solve resource allocation and scheduling problems that are common in software development projects. It evaluates the performance of three algorithms - simulated annealing, tabu search, and genetic algorithms - on test problems representative of resource constrained project scheduling problems (RCPSP). The experimental results found that all three metaheuristics can solve such problems effectively, with genetic algorithms performing slightly better overall than the other two techniques.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. This is an ``extreme multi-label classification'' problem, where the objective is to assign a small subset of the most relevant subjects from an extremely large label set. Data sparsity and model scalability are the major challenges we need to address to solve it automatically. In this paper, we describe an efficient and effective embedding method that embeds terms, subjects and documents into the same semantic space, where similarity can be computed easily. We then propose a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are not predicted by state-of-the-art classifiers.
The document proposes a framework for mining product reputations from online opinions. It extracts opinions from web pages, labels them as positive, negative, or opinion likelihood. Reputation is analyzed using rule analysis to extract characteristic words, co-occurrence analysis, typical sentence analysis, and correspondence analysis to map relationships. Experiments analyzing opinions of cell phones, PDAs, and ISPs showed the framework in action. The framework combines opinion extraction with text mining to automatically gather and analyze large volumes of online opinions.
Hannah Peeler is a senior at the University of Texas at Austin majoring in electrical engineering with an overall GPA of 3.87. She has experience with embedded systems and circuitry design from academic projects. Peeler has held leadership roles with IEEE and the Women in Engineering Program where she has organized events and mentored other students. She also participated in a leadership development program and is involved with other engineering organizations on campus. Peeler will graduate in May 2018 with honors and is eligible to work in the US without restrictions.
This summarizes an academic paper that proposes an automatic ontology creation method for classifying research papers. It uses text mining techniques like classification and clustering algorithms. It first builds a research ontology by extracting keywords and patterns from previous papers. It then uses a decision tree algorithm to classify new papers into disciplines defined in the ontology. The classified papers are then clustered based on similarities to group them. The method was tested on a dataset of 100 papers and achieved average precision of 85.7% for term-based and 89.3% for pattern-based keyword extraction.
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
This document discusses exploiting Wikipedia and Twitter for text mining applications. It explores using Wikipedia's category-article structure for text classification, subjectivity analysis, and keyword extraction. It evaluates classifying tweets as relevant/irrelevant to entities or brands and classifying tweets into topical dimensions like workplace or innovation. Features used include relatedness scores between tweet text and Wikipedia categories, topic modeling scores, and Twitter-specific features. Experimental results show the Wikipedia framework based on its category-article structure outperforms standard text mining techniques.
Automatic Grading of Handwritten AnswersIRJET Journal
The document describes a proposed system for automatically grading handwritten exam answers. It uses optical character recognition to extract text from scanned answer sheets. Natural language processing techniques like BERT embeddings and Word Mover's Distance are used to compare the student's answer against reference answers from teachers. The system aims to grade papers quickly and accurately to reduce workload for teachers while still providing detailed performance assessments for students. It was developed to address the need for online and at-scale exam grading given limitations of traditional in-person exams.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Anubhav Jain
1) The document discusses evaluating machine learning algorithms for materials science using the Matbench protocol.
2) Matbench provides standardized datasets, testing procedures, and an online leaderboard to benchmark and compare machine learning performance.
3) This allows different groups to evaluate algorithms independently and identify best practices for materials science predictions.
This document describes a graph-based approach for skill extraction from text. It discusses expertise retrieval and previous related work. It then describes the Elisit system, which uses Wikipedia pages and spreading activation on Wikipedia's hyperlink network to associate skills with input documents. Sample queries to the system are provided. The method associates documents with Wikipedia pages and then performs spreading activation to find related skills. Evaluation shows biasing the spreading activation improves results.
ICML 2018 included papers on generative models, music and audio applications, and AI security. On generative models, papers explored topics like learning many-to-many mappings between domains, joint distribution learning, and reducing amortization gaps in VAEs. In music, works examined hierarchical latent space models for music structure and style transfer for speech synthesis. Regarding security, studies analyzed adversarial attacks across domains, the threat of adversarial examples, and circumventing defenses through obfuscated gradients.
This document summarizes work done by the Julia Labs at MIT on genomics data analysis and optimization of principal component analysis algorithms for genome-wide association studies. It describes how a native Julia implementation of PCA was able to reduce computation time for finding the top 10 principal components of a 80,000x40,000 genotype matrix from over 2,900 seconds to just 81 seconds. It also discusses how custom matrix-vector multiplication functions allowed the same computation speed while using 32x less memory by reading directly from a compressed data format. Future work directions include more complex analytics, improved data imputation methods, out-of-core matrix operations, and accessing different data formats.
The ontology engineering research community has focused for many years on supporting the creation, development and evolution of ontologies. Ontology forecasting, which aims at predicting semantic changes in an ontology, represents instead a new challenge. In this paper, we want to give a contribution to this novel endeavour by focusing on the task of forecasting semantic concepts in the research domain. Indeed, ontologies representing scientific disciplines contain only research topics that are already popular enough to be selected by human experts or automatic algorithms. They are thus unfit to support tasks which require the ability of describing and exploring the forefront of research, such as trend detection and horizon scanning. We address this issue by introducing the Semantic Innovation Forecast (SIF) model, which predicts new concepts of an ontology at time t+1 , using only data available at time t. Our approach relies on lexical innovation and adoption information extracted from historical data. We evaluated the SIF model on a very large dataset consisting of over one million scientific papers belonging to the Computer Science domain: the outcomes show that the proposed approach offers a competitive boost in mean average precision-at-ten compared to the baselines when forecasting over 5 years.
Question Answering System using machine learning approachGarima Nanda
In a compact form, this is a presentation reflecting how the machine learning approach can be used for the effective and efficient interaction using classification techniques.
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...Punit Sharnagat
OSMnx is a Python package to retrieve, model, analyze, and visualize street networks from OpenStreetMap.
OpenStreetMap (OSM) is a collaborative mapping project that provides a free and publicly editable map of the world.
OpenStreetMap provides a valuable crowd-sourced database of raw geospatial data for constructing models of urban street networks for scientific analysis
A Julia package for iterative SVDs with applications to genomics data analysisJiahao Chen
This document discusses a Julia package for iterative singular value decompositions (SVDs) with applications to genomics data analysis. It introduces SVD and its use in genome-wide association studies to identify genetic factors associated with diseases and traits. It summarizes different iterative SVD methods like Lanczos iteration and challenges like loss of orthogonality. The document presents a new Julia package called FlashPCA that uses a blocked power method to quickly approximate the top SVD components, and compares its performance to other iterative SVD solvers on large genomic datasets.
IRJET- Finding the Original Writer of an Anonymous Text using Naïve Bayes Cla...IRJET Journal
This document discusses using a Naive Bayes classifier to identify the original writer of an anonymous text. It begins with an introduction to the problem and an overview of Naive Bayes classification. It then describes how a text can be represented as a bag-of-words with word frequencies and explains how Naive Bayes can be used to calculate the posterior probability and predict the class. The document compares the performance of Naive Bayes to other algorithms using an email dataset, finding that Naive Bayes has the highest performance factor due to its fast training and prediction times relative to its accuracy. It concludes that Naive Bayes can accurately predict the original writer of anonymous texts.
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
The document discusses the development of Matbench, a standardized benchmark for evaluating machine learning algorithms for materials property prediction. Matbench includes 13 standardized datasets covering a variety of materials prediction tasks. It employs a nested cross-validation procedure to evaluate algorithms and ranks submissions on an online leaderboard. This allows for reproducible evaluation and comparison of different algorithms. Matbench has provided insights into which algorithm types work best for certain prediction problems and has helped measure overall progress in the field. Future work aims to expand Matbench with more diverse datasets and evaluation procedures to better represent real-world materials design challenges.
This document provides an overview of the CSE 591: Machine Learning and Applications course taught by Dr. Jieping Ye at Arizona State University. The following key points are discussed:
- Course information including instructor, time/location, prerequisites, objectives to provide an understanding of machine learning methods and applications.
- Topics covered include clustering, classification, dimensionality reduction, semi-supervised learning, and kernel learning.
- The grading breakdown includes homework, a group project, and an exam. Students are required to participate in class discussions.
- An introduction to machine learning is provided including definitions of supervised vs. unsupervised learning and applications in domains like bioinformatics.
Extracting and Making Use of Materials Data from Millions of Journal Articles...Anubhav Jain
- The document discusses using natural language processing techniques to extract materials data from millions of journal articles.
- It aims to organize the world's information on materials science by using NLP models to extract useful data from unstructured text sources like research literature in an automated manner.
- The process involves collecting raw text data, developing machine learning models to extract entities and relationships, and building search interfaces to make the extracted data accessible.
Integrating natural language processing and software engineeringNakul Sharma
This document summarizes research on integrating natural language processing and software engineering. It provides a literature review of works that have used natural language text as input to generate software engineering artifacts like UML diagrams, test cases, and process models. The paper also discusses how techniques from natural language processing can be applied to different phases of the software development life cycle and how natural language understanding can help automate software engineering tasks.
Resource Allocation Using Metaheuristic Searchcsandit
This document discusses using metaheuristic search techniques to solve resource allocation and scheduling problems that are common in software development projects. It evaluates the performance of three algorithms - simulated annealing, tabu search, and genetic algorithms - on test problems representative of resource constrained project scheduling problems (RCPSP). The experimental results found that all three metaheuristics can solve such problems effectively, with genetic algorithms performing slightly better overall than the other two techniques.
This document proposes using Word2Vec and decision trees to extract keywords from textual documents and classify the documents. It reviews related work on keyword extraction and text classification techniques. The proposed approach involves preprocessing text, representing words as vectors with Word2Vec, calculating frequently occurring keywords for each category, and using decision trees to classify documents based on keyword similarity. Experiments using different preprocessing and Word2Vec settings achieved an F-score of up to 82% for document classification.
Automatic subject prediction is a desirable feature for modern digital library systems, as manual indexing can no longer cope with the rapid growth of digital collections. This is an ``extreme multi-label classification'' problem, where the objective is to assign a small subset of the most relevant subjects from an extremely large label set. Data sparsity and model scalability are the major challenges we need to address to solve it automatically. In this paper, we describe an efficient and effective embedding method that embeds terms, subjects and documents into the same semantic space, where similarity can be computed easily. We then propose a novel Non-Parametric Subject Prediction (NPSP) method and show how effectively it predicts even very specialised subjects, which are associated with few documents in the training set and are not predicted by state-of-the-art classifiers.
The document proposes a framework for mining product reputations from online opinions. It extracts opinions from web pages, labels them as positive, negative, or opinion likelihood. Reputation is analyzed using rule analysis to extract characteristic words, co-occurrence analysis, typical sentence analysis, and correspondence analysis to map relationships. Experiments analyzing opinions of cell phones, PDAs, and ISPs showed the framework in action. The framework combines opinion extraction with text mining to automatically gather and analyze large volumes of online opinions.
Hannah Peeler is a senior at the University of Texas at Austin majoring in electrical engineering with an overall GPA of 3.87. She has experience with embedded systems and circuitry design from academic projects. Peeler has held leadership roles with IEEE and the Women in Engineering Program where she has organized events and mentored other students. She also participated in a leadership development program and is involved with other engineering organizations on campus. Peeler will graduate in May 2018 with honors and is eligible to work in the US without restrictions.
This summarizes an academic paper that proposes an automatic ontology creation method for classifying research papers. It uses text mining techniques like classification and clustering algorithms. It first builds a research ontology by extracting keywords and patterns from previous papers. It then uses a decision tree algorithm to classify new papers into disciplines defined in the ontology. The classified papers are then clustered based on similarities to group them. The method was tested on a dataset of 100 papers and achieved average precision of 85.7% for term-based and 89.3% for pattern-based keyword extraction.
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
This document discusses exploiting Wikipedia and Twitter for text mining applications. It explores using Wikipedia's category-article structure for text classification, subjectivity analysis, and keyword extraction. It evaluates classifying tweets as relevant/irrelevant to entities or brands and classifying tweets into topical dimensions like workplace or innovation. Features used include relatedness scores between tweet text and Wikipedia categories, topic modeling scores, and Twitter-specific features. Experimental results show the Wikipedia framework based on its category-article structure outperforms standard text mining techniques.
Automatic Grading of Handwritten AnswersIRJET Journal
The document describes a proposed system for automatically grading handwritten exam answers. It uses optical character recognition to extract text from scanned answer sheets. Natural language processing techniques like BERT embeddings and Word Mover's Distance are used to compare the student's answer against reference answers from teachers. The system aims to grade papers quickly and accurately to reduce workload for teachers while still providing detailed performance assessments for students. It was developed to address the need for online and at-scale exam grading given limitations of traditional in-person exams.
Text Segmentation for Online Subjective Examination using Machine LearningIRJET Journal
This document discusses using k-Nearest Neighbor (K-NN) machine learning for text segmentation of online exams. K-NN is an instance-based learning method that computes similarity between feature vectors to determine similarity between texts. The goal is to implement natural language processing using text segmentation, which provides benefits. It reviews related work applying various machine learning methods like K-NN, support vector machines, decision trees to tasks like text categorization and clustering.
The document describes a project to semantically annotate research papers with ACM classification categories. It discusses using cosine similarity, latent Dirichlet allocation, and a proposed model combining labeled LDA and doc2vec. The proposed model trains a supervised topic model to learn document representations that capture semantic relationships between papers and categories. The model achieved 59.31% mean average precision and 45.03% NDCG on a test dataset, demonstrating an improvement over baselines.
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
This document proposes a knowledge graph and question answering system to extract and analyze information from large volumes of unstructured data like annual reports. It discusses using natural language processing techniques like named entity recognition with spaCy and dependency parsing to extract entity-relation pairs from text and construct a knowledge graph. For question answering, it analyzes user queries with similar NLP approaches and then matches query triplets to the knowledge graph to retrieve answers, combining information retrieval and trained classifiers. The proposed system aims to provide faster understanding and analysis of complex, unstructured data for professionals.
Ukrainian Catholic University
Faculty of Applied Sciences
Data Science Master Program
January 22nd
Abstract. Generative adversarial networks (GANs) are one of the most popular models capable of producing high-quality images. However, most of the works generate images from the vector of random values, without explicit control of desired output properties. We study the ways of introducing such control for the user-selected region of interest (RoI). First, we overview and analyze the existing works in areas of image completion (inpainting) and controllable generation. Second, we propose our model based on GANs, which united approaches from the two mentioned areas, for the controllable local content generation. Third, we evaluate the controllability of our model on three accessible datasets – Celeba, Cats, and Cars – and give numerical and visual results of our method.
Migration strategies for object oriented system to component based systemijfcstjournal
Migration of object oriented system to component based System is not an easy task, not only technically a
lot of changes needs to be done but also numerous other issues needs to be kept in mind. However Component based Software development has been gaining its popularity from the past few years and has higher reusability scope. Programs built using CBSE approach are confirmed to be suitable to new environments. These days it’s a universal practice to reuse components in project to achieve better quality and to save time. So moving to CBSE from object oriented seems wise decision. Number of approaches has been introduced to implement this and each one of them has its own pros and cons. The paper focuses on the brief review on works of different authors in this area from the year 2000 to 2014.
A Recommender Story: Improving Backend Data Quality While Reducing CostsDatabricks
A recommender story: improving backend data quality while reducing costsnInformation overload is one of the biggest challenges academics face on a daily basis while finding the right knowledge to advance science. With around 7k research articles being published every day, how do you find the right ones?
Elsevier is a global information analytics business that helps institutions and professionals advance healthcare, open science and improve performance. With many data sources and signals being available, data science and big data engineering provide the perfect opportunity to deliver more value to researchers.
Here we will focus on Mendeley, an open (free of charge) academic content platform to help researchers discover new information via functionalities such as a crowd sourced collection of academic related documents (Catalogue) and various personalized recommender systems. MendeleySuggest, the recommender system, helps millions of researchers worldwide to find documents and people relevant to their research field, they did not yet know exist. The personalised recommenders are powered by Mendeley Catalogue, clustering 2 billion records correctly into canonical records, state of the art algorithms and big data solutions (e.g. Spark).
In the past few years, we noticed that with our content growth, quality of the canonical records started drifting due to scalability issues. As a result, we faced clustering accuracy problems and, in turn, impacting also the recommenders. In this talk we will highlight how we rearchitected the fabrication of Mendeley Catalogue to improve its scalability and accuracy. In addition, we will show how the migration from Hadoop Map Reduce to Spark has helped us reduce costs as well as improving maintainability.
This document discusses pointer analysis techniques for object-oriented languages. It provides background on pointer analysis and its importance for program analysis and optimization. It also reviews related work on pointer analysis algorithms for both procedural and object-oriented languages. The document motivates the need for precise pointer analysis to enable optimizations like instruction scheduling and register allocation. It then provides an introduction to Binary Decision Diagrams which are used to represent pointer analysis results.
How to Write an Effective Technical Paper (1).pdfkhalid khan
This document summarizes a webinar on how to write an effective technical paper. The webinar was presented by Saifur Rahman, PhD, who is a professor at Virginia Tech and former president of the IEEE Power & Energy Society. The webinar covered topics like choosing an audience, following proper structure and format, and addressing ethical standards. It provided guidance on elements like the title, abstract, introduction, methodology, results, and references. The webinar emphasized writing clearly for editors and reviewers and suggested steps for collaboratively writing a paper.
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
Background: Network science is the set of mathematical frameworks, models, and measures that are used to understand a complex system modeled as a network composed of nodes and edges. The nodes of a network represent entities and the edges represent relationships between these entities. Network science has been used in many research works for mining human interaction during different phases of software engineering (SE). Objective: The goal of this study is to identify, review, and analyze the published research works that used network analysis as a tool for understanding the human collaboration on different levels of software development. This study and its findings are expected to be of benefit for software engineering practitioners and researchers who are mining software repositories using tools from network science field. Method: We conducted a systematic literature review, in which we analyzed a number of selected papers from different digital libraries based on inclusion and exclusion criteria. Results: We identified 35 primary studies (PSs) from four digital libraries, then we extracted data from each PS according to a predefined data extraction sheet. The results of our data analysis showed that not all of the constructed networks used in the PSs were valid as the edges of these networks did not reflect a real relationship between the entities of the network. Additionally, the used measures in the PSs were in many cases not suitable for the used networks. Also, the reported analysis results by the PSs were not, in most cases, validated using any statistical model. Finally, many of the PSs did not provide lessons or guidelines for software practitioners that can improve the software engineering practices. Conclusion: Although employing network analysis in mining developers’ collaboration showed some satisfactory results in some of the PSs, the application of network analysis needs to be conducted more carefully. That is said, the constructed network should be representative and meaningful, the used measure needs to be suitable for the context, and the validation of the results should be considered. More and above, we state some research gaps, in which network science can be applied, with some pointers to recent advances that can be used to mine collaboration networks.
On Using Network Science in Mining Developers Collaboration in Software Engin...IJDKP
Background: Network science is the set of mathematical frameworks, models, and measures that are
used to understand a complex system modeled as a network composed of nodes and edges. The nodes of a network
represent entities and the edges represent relationships between these entities. Network science has been used in
many research works for mining human interaction during different phases of software engineering (SE)
This document discusses defect and architectural metrics for assessing the quality of C language software. It presents research goals to determine the distribution of dependency and social network metrics in projects, their relationship to defects, and how metrics and defects evolve over time. Methodologies for generating dependency graphs and calculating metrics are described. Results show correlations between fixing commits and certain metrics, and that metrics change at points of refactoring or major bugs. The research aims to help localize architectural problems through metrics-based analysis.
This document summarizes Andre Freitas' talk on AI beyond deep learning. It discusses representing meaning from text at scale using knowledge graphs and embeddings. It also covers using neuro-symbolic models like graph networks on top of knowledge graphs to enable few-shot learning, explainability, and transportability. The document advocates that AI engineers should focus on representation design and evaluating multi-component NLP systems.
Text Summarization and Conversion of Speech to TextIRJET Journal
This document discusses text summarization and speech to text conversion using deep learning algorithms. It describes how recurrent neural networks can be used for text summarization by identifying key information and semantic meaning from text. Speech recognition uses similar deep learning methods to convert spoken audio to text. The document also provides an overview of the text summarization process, including segmentation, normalization, feature extraction, and modeling steps. It concludes that these models can generate summarized text from extensive documents and meetings.
Development of Computer Aided Learning Software for Use in Electric Circuit A...drboon
Presently, instructors are required to teach more students with the same resources, thereby reducing the amount of time instructors have with their students. Because of this, examples may be omitted to be able to make it through all of the required material. This can be problematic with electric circuit analysis courses and other courses used as prerequisites. A lack of understanding in these classes will likely continue in future classes. While software is often used in these classes, often it is analysis software not meant to teach concepts. Teaching software does exist, but may have only a preset number of problems or only provide the solution. Others provide a ‘limitless’ number of problems by changing component values, but each ends up being the same basic problem. This paper introduces new learning software that addresses these shortcomings. The software provides a practically limitless number of problems by varying component values and circuit structure. Moreover, it provides both an answer and an explanation. Finally, it is designed so that students who need more help can get it, while those who do not can move on.
IRJET- Natural Language Query ProcessingIRJET Journal
The document discusses the development of a natural language query processing system that allows users to retrieve data from a database using simple English statements rather than SQL queries. It proposes a system that takes an English query as input, analyzes it to extract keywords, uses those keywords to generate an equivalent SQL query, executes the SQL query on the database, and returns the results to the user. The system is meant to make accessing database information easier for non-technical users by allowing them to use natural language instead of SQL.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
Claremont Report on Database Research: Research Directions (Le Gruenwald)infoblog
This is a set of slides from the Claremont Report on Database Research, see http://db.cs.berkeley.edu/claremont/ for more details. These particular slides are from a "Research Directions" talk by "Le Gruenwald." (Uploaded for discussion at the Stanford InfoBlog, http://infoblog.stanford.edu/.)
The document summarizes a research paper that proposed a link prediction model for citation networks. It applied support vector machines (SVMs) as the classifier and used 11 features optimized for citation networks across 5 academic fields. The model was able to better predict links compared to just using the classifier's performance metrics. However, the effective features varied by academic field, suggesting different models should be applied for different research areas.
Similar to Determining the Credibility of Science Communication (20)
Beyond Fact Checking — Modelling Information Change in Scientific CommunicationIsabelle Augenstein
The document discusses modelling information change in scientific communication. It begins by noting how science is often communicated through journalists to the public, and how the message can change and become exaggerated or misleading along the way. It then discusses developing models to detect exaggeration by predicting the strength of causal claims, such as distinguishing between correlational and causal language. Pattern exploiting training is explored as a way to leverage large language models for this task in a semi-supervised manner. Finally, it proposes generally modelling information change by comparing original research to how it is communicated elsewhere, such as in news articles and tweets, using semantic matching techniques. Experiments are discussed on newly created datasets to benchmark performance of models on this task.
The document discusses automatically detecting scientific misinformation and exaggeration. It introduces work on cite-worthiness detection to improve scientific document understanding, and on detecting exaggeration in health science press releases. It describes generating scientific claims from citations for zero-shot scientific fact checking. The talk covers claim detection and generation, cite-worthiness detection, scientific claim generation, and exaggeration detection.
The past decade has seen a substantial rise in the amount of mis- and disinformation online, from targeted disinformation campaigns to influence politics, to the unintentional spreading of misinformation about public health. This development has spurred research in the area of automatic fact checking, a knowledge-intensive and complex reasoning task. Most existing fact checking models predict a claim’s veracity with black-box models, which often lack explanations of the reasons behind their predictions and contain hidden vulnerabilities. The lack of transparency in fact checking systems and ML models, in general, has been exacerbated by increased model size and by “the right…to obtain an explanation of the decision reached” enshrined in European law. This talk presents some first solutions to generating explanations for fact checking models. It then examines how to assess the generated explanations using diagnostic properties, and how further optimising for these diagnostic properties can improve the quality of the generating explanations. Finally, the talk examines how to systemically reveal vulnerabilities of black-box fact checking models.
Towards Explainable Fact Checking (DIKU Business Club presentation)Isabelle Augenstein
Outline:
- Fact checking – what is it and why do we need it?
- False information online
- Content-based automatic fact checking
- Explainability – what is it and why do we need it?
- Making the right predictions for the right reasons
- Model training pipeline
- Explainable fact checking – some first solutions
- Rationale selection
- Generating free-text explanations
- Wrap-up
Tutorial on 'Explainability for NLP' given at the first ALPS (Advanced Language Processing) winter school: http://lig-alps.imag.fr/index.php/schedule/
The talk introduces the concepts of 'model understanding' as well as 'decision understanding' and provides examples of approaches from the areas of fact checking and text classification.
Exercises to go with the tutorial are available here: https://github.com/copenlu/ALPS_2021
Automatic fact checking is one of the more involved NLP tasks currently researched: not only does it require sentence understanding, but also an understanding of how claims relate to evidence documents and world knowledge. Moreover, there is still no common understanding in the automatic fact checking community of how the subtasks of fact checking — claim check-worthiness detection, evidence retrieval, veracity prediction — should be framed. This is partly owing to the complexity of the task, despite efforts to formalise the task of fact checking through the development of benchmark datasets.
The first part of the talk will be on automatically generating textual explanations for fact checking, thereby exposing some of the reasoning processes these models follow. The second part of the talk will be on re-examining how claim check-worthiness is defined, and how check-worthy claims can be detected; followed by how to automatically generate claims which are hard to fact-check automatically.
Talk on 'Tracking False Information Online' at W-NUT workshop at EMNLP 2019.
=========
Digital media enables fast sharing of information and discussions among users. While this comes with many benefits to today’s society, such as broadening information access, the manner in which information is disseminated also has obvious downsides. Since fast access to information is expected by many users and news outlets are often under financial pressure, speedy access often comes at the expense of accuracy, which leads to misinformation. Moreover, digital media can be misused by campaigns to intentionally spread false information, i.e. disinformation, about events, individuals or governments. In this talk, I will present on different ways false information is spread online, including misinformation and disinformation. I will then report findings from our recent and ongoing work on automatic fact checking, stance detection and framing attitudes.
What can typological knowledge bases and language representations tell us abo...Isabelle Augenstein
One of the core challenges in typology is to record properties of languages in a structured way. As a result of manual efforts, typological knowledge bases have emerged, which contains information about languages’ phonological, morphological and syntactic properties; as well as information about language families. Ideally, such typological knowledge bases would provide useful information for multilingual NLP models to learn how to selectively share parameters.
A related area of research suggests a different way of encoding properties of languages, namely to learn language representation vectors directly from text documents.
In this talk, I will analyse and contrast these two ways of encoding linguistic properties, as well as present research on how the two can benefit one another.
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate ...Isabelle Augenstein
Paper presented at NAACL 2018. Link: https://arxiv.org/abs/1802.09913
Abstract:
============
We combine multi-task learning and semi-supervised learning by inducing a joint embedding space between disparate label spaces and learning transfer functions between label embeddings, enabling us to jointly leverage unlabelled data and auxiliary, annotated datasets. We evaluate our approach on a variety of sequence classification tasks with disparate label spaces. We outperform strong single and multi-task baselines and achieve a new state-of-the-art for topic-based sentiment analysis.
Learning with limited labelled data in NLP: multi-task learning and beyondIsabelle Augenstein
When labelled training data for certain NLP tasks or languages is not readily available, different approaches exist to leverage other resources for the training of machine learning models. Those are commonly either instances from a related task or unlabelled data.
An approach that has been found to work particularly well when only limited training data is available is multi-task learning.
There, a model learns from examples of multiple related tasks at the same time by sharing hidden layers between tasks, and can therefore benefit from a larger overall number of training instances and extend the models' generalisation performance. In the related paradigm of semi-supervised learning, unlabelled data as well as labelled data for related tasks can be easily utilised by transferring labels from labelled instances to unlabelled ones in order to essentially extend the training dataset.
In this talk, I will present my recent and ongoing work in the space of learning with limited labelled data in NLP, including our NAACL 2018 papers 'Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces [1] and 'From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings’ [2].
[1] https://t.co/A5jHhFWrdw
[2] https://arxiv.org/abs/1802.09375
==========
Bio from my website http://isabelleaugenstein.github.io/index.html:
I am a tenure-track assistant professor at the University of Copenhagen, Department of Computer Science since July 2017, affiliated with the CoAStAL NLP group and work in the general areas of Statistical Natural Language Processing and Machine Learning. My main research interests are weakly supervised and low-resource learning with applications including information extraction, machine reading and fact checking.
Before starting a faculty position, I was a postdoctoral research associate in Sebastian Riedel's UCL Machine Reading group, mainly investigating machine reading from scientific articles. Prior to that, I was a Research Associate in the Sheffield NLP group, a PhD Student in the University of Sheffield Computer Science department, a Research Assistant at AIFB, Karlsruhe Institute of Technology and a Computational Linguistics undergraduate student at the Department of Computational Linguistics, Heidelberg University.
Spreading of mis- and disinformation is growing and is having a big impact on interpersonal communications, politics and even science.
Traditional methods, e.g. manual fact-checking by reporters cannot keep up with the growth of information. On the other hand, there has been much progress in natural language processing recently, partly due to the resurgence of neural methods.
How can natural language processing methods fill this gap and help to automatically check facts?
This talk will explore different ways to frame fact checking and detail our ongoing work on learning to encode documents for automated fact checking, as well as describe future challenges.
1st Workshop for Women and Underrepresented Minorities (WiNLP) at ACL 2017 - ...Isabelle Augenstein
The document summarizes the history and goals of the WiNLP workshop, which aims to promote and support women and underrepresented groups in natural language processing. It discusses the growth of WiNLP from its inception in 2016 to the 2017 workshop with over 130 participants. It outlines WiNLP's mission to increase awareness of work by underrepresented groups and build community. It also notes challenges such as underrepresentation, bias, and lack of resources that WiNLP addresses through mentoring, funding, and community building.
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
USFD at SemEval-2016 - Stance Detection on Twitter with AutoencodersIsabelle Augenstein
This paper describes the University of Sheffield's submission to the SemEval 2016 Twitter Stance Detection weakly supervised task (SemEval 2016 Task 6, Subtask B). In stance detection, the goal is to classify the stance of a tweet towards a target as "favor", "against", or "none". In Subtask B, the targets in the test data are different from the targets in the training data, thus rendering the task more challenging but also more realistic.
To address the lack of target-specific training data, we use a large set of unlabelled tweets containing all targets and train a bag-of-words autoencoder to learn how to produce feature representations of tweets. These feature representations are then used to train a logistic regression classifier on labelled tweets, with additional features such as an indicator of whether the target is contained in the tweet. Our submitted run on the test data achieved an F1 of 0.3270.
Paper: http://isabelleaugenstein.github.io/papers/SemEval2016-Stance.pdf
Imitation learning is used to address the problem of distant supervision for relation extraction. It decomposes the task into named entity classification (NEC) and relation extraction (RE), allowing the models to be trained separately. Through an iterative process, imitation learning is able to learn the dependencies between NEC and RE even when only labels for RE are provided. This overcomes limitations of prior approaches that rely on distantly labeled data. Evaluation shows the approach improves over baselines by leveraging multi-stage modeling to compensate for mistakes at the NEC stage.
Extracting Relations between Non-Standard Entities using Distant Supervision ...Isabelle Augenstein
Poster for our EMNLP paper on extracting non-standard relations from the Web with distant supervision and imitation learning. Read the full paper here: https://aclweb.org/anthology/D/D15/D15-1086.pdf
Slides for my tutorial at the ESWC Summer School 2015, giving an introduction to information extraction with Linked Data and an introduction to one of the applications of information extraction, opinion mining.
Relation Extraction from the Web using Distant SupervisionIsabelle Augenstein
This paper proposes using distant supervision to extract relations from web text to populate knowledge bases without requiring manual effort. It does this by using an existing knowledge base to automatically label sentences with entity relations, training a classifier on this distant supervision data. The paper describes using statistical methods to select better training data and discard noisy examples, and shows this improves precision. It also introduces methods for integrating information across sentences which improves both precision and recall of extracted relations.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
A review on techniques and modelling methodologies used for checking electrom...nooriasukmaningtyas
The proper function of the integrated circuit (IC) in an inhibiting electromagnetic environment has always been a serious concern throughout the decades of revolution in the world of electronics, from disjunct devices to today’s integrated circuit technology, where billions of transistors are combined on a single chip. The automotive industry and smart vehicles in particular, are confronting design issues such as being prone to electromagnetic interference (EMI). Electronic control devices calculate incorrect outputs because of EMI and sensors give misleading values which can prove fatal in case of automotives. In this paper, the authors have non exhaustively tried to review research work concerned with the investigation of EMI in ICs and prediction of this EMI using various modelling methodologies and measurement setups.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapte...University of Maribor
Slides from talk presenting:
Aleš Zamuda: Presentation of IEEE Slovenia CIS (Computational Intelligence Society) Chapter and Networking.
Presentation at IcETRAN 2024 session:
"Inter-Society Networking Panel GRSS/MTT-S/CIS
Panel Session: Promoting Connection and Cooperation"
IEEE Slovenia GRSS
IEEE Serbia and Montenegro MTT-S
IEEE Slovenia CIS
11TH INTERNATIONAL CONFERENCE ON ELECTRICAL, ELECTRONIC AND COMPUTING ENGINEERING
3-6 June 2024, Niš, Serbia
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
3. Supporting the Life Cycle of Research
26/08/2021 3
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis
4. Scholarly Document Processing
• Goal: to automatically process scientific text to support scholars
• Example NLP tasks
• Extract information about scientific concepts, e.g. drugs and proteins
• Recommend relevant papers to cite
• Challenges
• Supervised learning is hard: annotation is expensive, requiring
domain experts
• Language used is diverse across fields
• Different modalities
• Meta-data also important
26/08/2021 4
7. Credibility and Veracity of Science Communication
• Shortcomings of prior work
• Assumes scientific writing is credible
• Assumes claims made are supported by underlying evidence
• Examples issues
• When writing a paper
• Making claims not backed up by literature
• Missing important citations
• Presenting conclusions not supported by data
• Popular science communication
• Distortion of findings
• Exaggerations
• Outright misrepresentations
26/08/2021 7
8. • Cite-worthiness detection
• Detecting if a sentence should include a citation to prior work
• Useful for assistive writing of scientific papers
• Similar to claim detection in fact checking
• Exaggeration detection
• Detecting if a news article exaggerates claims made in a scientific
paper
• Useful for assistive writing & quality check of press releases
• Related to veracity prediction, but more nuanced task
Challenges Addressed In This Talk
9. Overview of Today’s Talk
• Introduction
• The Life Cycle of Scientific Research
• Part 1: Cite-Worthiness Detection
• The CiteWorth dataset
• Methods for cite-worthiness detection
• Part 2: Exaggeration Detection
• Task framing
• Semi-supervised learning for exaggeration detection
• Conclusion
• Future research challenges
11. Scholarly Document Processing
• Challenges
• Supervised learning is hard: annotation is expensive, requiring
domain experts
• The text is diverse across fields
• How can we improve tools for scholarly document processing
across fields?
• What training data is readily available?
26/08/2021 11
13. Citances in Machine Learning
26/08/2021 13
We use the model from the original BERT paper (Devlin et al. 2019).
Cite-worthiness: Is this a citance? Yes
Recommendation: What paper should be cited? Devlin et al. (2019)
Influence: Was this an influential paper? Yes
Intent: What is the purpose of the citation? Method
14. Cite-Worthiness Uses
26/08/2021 14
We use the model from the original BERT paper (Devlin et al. 2019).
As an auxiliary task in a multi-task setup
We use the model from the original BERT paper (Devlin et al. 2019).
CITE-WORTHY METHOD
15. Cite-Worthiness Uses
26/08/2021 15
We use the model from the original BERT paper (Devlin et al. 2019).
As a first step in citation recommendation
We use the model from the original BERT paper (Devlin et al. 2019).
CITE-WORTHY
16. Cite-Worthiness Uses
26/08/2021 16
We use the model from the original BERT paper (Devlin et al. 2019).
For assistive document editing
We use the model from the original BERT paper (Devlin et al. 2019).
CITE-WORTHY
17. Cite-Worthiness Datasets
• Tend to be small and limited to only a few domains
(e.g. Computer Science)
• No attention paid to how clean is the data
26/08/2021 17
We use the model from Devlin et al. (2019) as a baseline.
e.g. ungrammatical phrases
18. CiteWorth: Dataset Curation
26/08/2021 18
1. https://github.com/allenai/s2orc
We use the model from the original BERT paper (Devlin et al. 2019).
We use the model from the original BERT paper [1].
Parenthetical author/year and bracketed numerical citations only
Citations must be at the end of a sentence
• We limit citances as follows
• Source data: S2ORC1 – millions of extracted scientific
documents from Semantic Scholar
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?
19. CiteWorth: Cleaning the Data
26/08/2021 19
We use the model from the original BERT paper (Devlin et al. 2019). This
model uses self-attention and masked language modeling.
1. Extract whole paragraphs – data is curated at the paragraph level
2. Check all the gold citation spans if they are parenthetical author/year
or bracketed numerical
3. Check if all citation spans have been extracted for each sentence
4. Check if all citation spans come at the end of a sentence
5. Remove citation spans using gold spans
6. Check if any citation markers are left over (e.g. hanging
prepositions/punctuation)
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?
20. CiteWorth Final Dataset
• 1,181,793 sentences
• 10 different fields, 20,000+ paragraphs per field
• Much cleaner than a naive baseline which only
removes citation text based on gold spans
26/08/2021 20
RQ1: How can a dataset for cite-worthiness detection be automatically curated with low noise?
Method Sentences Clean (%) Citation Markers Removed (%)
Naive Baseline 92.07 92.78
CiteWorth (Ours) 98.90 98.10
21. Predicting on Individual Sentences
26/08/2021 21
Pretrained Language Models
Transformer Network
Logistic Regression
Multi-Head
Attention
Feed-
Forward
Add & Norm
Add & Norm
Embedding
2
2. https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
Convolutional Recurrent Net1
1. Michael Färber, Alexander Thiemann, and Adam Ja-towt. 2018b. To Cite, or Not to Cite? DetectingCitation Contexts in
Text. InEuropean Conferenceon Information Retrieval, pages 598–603. Springer.
22. Predicting on Individual Sentences
26/08/2021 22
Can context improve performance?
Method P R F1
Logistic
Regression
46.65 64.88 54.28
CRNN 50.87 62.21 55.97
Transformer 47.92 71.59 57.39
BERT 55.04 69.02 61.23
SciBERT 57.03 68.08 62.06
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
* Pieter-Jan Kindermans, Kristof Schütt, Klaus-Robert Müller, and Sven Dähne. 2016. Investigating the Influence of Noise and
Distractors on the Interpretation of Neural Networks. arXiv preprint arXiv:1611.07270.
23. Predicting Multiple Sentences at Once
26/08/2021 23
Are there variations across field?
RQ2: What methods are most effective for automatically detecting cite-worthy sentences?
Longformer*
[CLS] !"
"
!"
# [SEP] !#
"
!#
#
!#
$ [SEP]
Pooling
Classify
Pooling
Classify
… …
Method P R F1
SciBERT 57.03 68.08 62.06
Longformer-Solo 57.21 68.00 62.14
Longformer-Ctx 59.92 77.15 67.45 Δ 5 pts
* Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The Long-Document Transformer. CoRR, abs/2004.05150.
24. Transfer Learning
• Pretrain a model and fine tune on 10 tasks (NER, relation
extraction, text classification)
• Base: Original SciBERT model fine-tuned on downstream tasks
• LM: SciBERT with MLM fine-tuning on CiteWorth
• Cite: SciBERT fine-tuned on cite-worthiness detection
• LMCite: SciBERT with MLM fine-tuning on CiteWorth + fine-
tuned on cite-worthiness
26/08/2021 24
RQ4: Can large scale cite-worthiness data be used to perform transfer learning to downstream scientific
text tasks?
26. Conclusions
• We introduce CiteWorth – a large, rigorously cleaned
dataset for citation-related tasks
• We show that paragraph level context is crucial to
perform cite-worthiness detection
• We show that the data is diverse with a significant
domain effect
• We show that cite-worthiness is a highly transferable
task for scientific text
26/08/2021 26
27. Open Questions
• How to improve domain adaptation for scientific text?
• What other useful features are there?
• Author network
• Document level context
• Other types of structure (case study: Discourse Structure)
• Other tasks using this data e.g. citation
recommendation
26/08/2021 27
28. Overview of Today’s Talk
• Introduction
• The Life Cycle of Scientific Research
• Part 1: Cite-Worthiness Detection
• The CiteWorth dataset
• Methods for cite-worthiness detection
• Part 2: Exaggeration Detection
• Task framing
• Semi-supervised learning for exaggeration detection
• Conclusion
• Future research challenges
31. Problem
26/08/2021 31
https://www.sciencedaily.com/releases/2021/05/210525101658.htm
Yijun Bao, Somayyeh Soltanian-Zadeh, Sina Farsiu, Yiyang Gong. Segmentation of neurons from fluorescence calcium
recordings beyond real time. Nature Machine Intelligence, 2021; DOI: 10.1038/s42256-021-00342-x
Abstract makes a
conditionally causal claim
(”potentially enabling”)
while the press release
makes a direct causal claim
32. Our Contributions
• Formalisation of the problem of scientific exaggeration
detection
• Curation of benchmark dataset for scientific
exaggeration detection
• Semi-supervised method based on Pattern Exploiting
Training (PET) to address the task
26/08/2021 32
33. Prior Work on Understanding Exaggeration in Science
• Manual attempts
• Sumner et al. 2014 and Bratton et al. 2019: InSciOut
• Manually label 823 pairs of press releases and abstracts
• Labels: causal claim strength of conclusions, advice given,
independent and dependent variables, etc.
• Find that about 33% of press releases contain exaggerated
conclusions
• Major problem: ”dominant link between academia and the
media” are press releases
• Automatic attempts
• Li et al. 2017, Yu et al. 2019, Yu et al. 2020
• Predict causal claim strength of conclusion sentences in
abstract and press release
• No clean paired data for evaluation
26/08/2021 33
34. Our Work on Exaggeration Detection in Science
• The focus of this work is predicting when a press release
exaggerates a scientific paper
• We focus on predicting this using the primary finding of the
paper as written in the abstract and the press release
• We build on previous work which focuses on causal claim
strength prediction of these primary findings
26/08/2021 34
36. Formal Problem Definition
26/08/2021 36
! = #$, &$, '$ ( ∈ [0 … -)}
Dataset !
Source documents &$
Target documents #$ written about &$
Labels '$, where
'$ ∈ 0
0 Downplays
1 Same
2 Exaggerates
indicates if #$ exaggerates, downplays, or faithfully represents &$
Learning goal: predict ' given & and #
37. Task Formulations
• T1
• Entailment-like task to predict
exaggeration label
• Paired (press release, abstract) data
26/08/2021 37
ℒ"# = %
0 Downplays
1 Same
2 Exaggerates
• T2
• Text classification task to predict causal
claim strength
• Unpaired press releases and abstracts
• Final prediction compares strength of
paired press release and abstract
ℒ": =
0 No Relation
1 Correlational
2 Conditional Causal
3 Direct Causal
Label Type Language Cue
0 No Relation
1 Correlational
association, associated with, predictor, at high risk
of
2 Conditional causal
increase, decrease, lead to, effect on, contribute to,
result in (Cues indicating doubt: may, might, appear
to, probably)
3 Direct causal
increase, decrease, lead to, effective on, contribute
to, reduce, can
Li et al. 2017
38. Evaluation Dataset Creation
26/08/2021 38
Start with the 823 labeled pairs from
Sumner et al. 2014 and Bratton et al. 2019
(InSciOut)
Collect original abstract text from Semantic
Scholar
Match original conclusion sentences to
paraphrased annotations via ROUGE
score
Manually inspect and discard missing or
incorrect abstracts
!
Downplays +, < +.
Same +, = +.
Exaggerates +, > +.
Final label: compare annotated claim
strength (+, for press release, +. for abstract)
Total data: 663 pairs (100 training, 553 test)
40. PET (Schick et al. 2020)
26/08/2021 40
Eating chocolate
causes happiness
! 0.01 0.21 0.15 '. ()
0 1 2 3
Traditional Classifier
Eating chocolate causes
happiness. The claim
strength is [MASK]
ℳ 0.01 0.21 0.15 '. ()
PET
m
edium
estim
ated
cautious
distorted
Pattern: transform the input to a
cloze-style question
Verbalizer: predict tokens from
the language model which reflect
the data’s labels
Large pretrained
language model
+,
+-
+.
ℳ,
ℳ-
ℳ.
/
!
0 /
Soft Labels
KL-Divergence Loss
(Unlabelled)
43. T1 (Exaggeration Detection) with MT-PET
26/08/2021 43
28,06
33,1
29,05
41,9
39,87 39,12
47,8 47,99 47,35
25
30
35
40
45
50
P R F1
Supervised PET MT-PET
Substantial improvements when using PET (10 points)
Further improvements with MT-PET (8 points)
Demonstrates transfer of knowledge from claim strength prediction to
exaggeration prediction
44. T2 (Claim Strength Prediction) with MT-PET
26/08/2021 44
49,28
51,07
49,03
55,76
58,58
56,57
56,68
60,13
57,44
45
50
55
60
P R F1
Supervised PET MT-PET
58,2
59,99
58,66
58,53
61,84 60,45
60,09 61,11
45
50
55
60
P R F1
Supervised PET MT-PET
MT-PET
outperforms
PET in both
scenarios
200 samples from T2, 100 samples from T1
4500 samples from T2, 100 samples from T1
45. T2 (Claim Strength Prediction) with MT-PET
26/08/2021 45
49,28
51,07
49,03
55,76
58,58
56,57
56,68
60,13
57,44
45
50
55
60
P R F1
Supervised PET MT-PET
58,2
59,99
58,66
58,53
61,84 60,45
60,09 61,11
45
50
55
60
P R F1
Supervised PET MT-PET
MT-PET with
200 samples
approaches
supervised
performance
with 4,500
samples
200 samples from T2, 100 samples from T1
4500 samples from T2, 100 samples from T1
46. Error Analysis
• All models:
• disproportionately get pairs involving direct causal claims
incorrect
• do best for correlational claims from abstracts and claims
from press releases which are correlational or stronger
• MT-PET:
• helps the most for the most difficult category -- causal claims
26/08/2021 46
47. Summary
• We formalize the problem of scientific exaggeration
detection, providing two task formulations for the
problem
• We curate a set of benchmark data to evaluate
automatic methods for performing the task
• We propose MT-PET, a few-shot learning method
based on PET, which we demonstrate outperforms
strong baselines
26/08/2021 47
48. Overview of Today’s Talk
• Introduction
• The Life Cycle of Scientific Research
• Part 1: Cite-Worthiness Detection
• The CiteWorth dataset
• Methods for cite-worthiness detection
• Part 2: Exaggeration Detection
• Task framing
• Semi-supervised learning for exaggeration detection
• Conclusion
• Future research challenges
50. Supporting the Life Cycle of Research
26/08/2021 50
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis
51. Supporting the Life Cycle of Research
26/08/2021 51
Reviewing
Support
Citation
Analysis
Writing
Assistance
Information
Discovery
Conducting
Experiments
Paper
Writing
Peer Review
Research
Impact
Tracking
Information
Extraction
Summarisa
tion
Citation
Prediction
Credibility
Detection
Reviewer
Matching
Review
Score
Prediction
Citation
Prediction
Citation
Trend
Analysis
NEW
52. Overall Take-Aways
• Why scholarly document processing?
• Supporting the life cycle of research, from information discovery to
research impact tracking
• Why credibility detection for scholarly communication?
• Detect claims which should be backed up by evidence
(cite-worthiness detection)
• Detect inconsistencies between primary and secondary sources of
information (exaggeration detection)
53. Overall Take-Aways
• Overarching challenges
• Difficult NLP tasks (require understanding of pragmatics)
• Domain effects, importance of context pose further challenges
• Not well-studied yet
• Scarcity of available benchmarks
• Many opportunities for future work
• Explore more different settings
• Gather more datasets
• Methods for domain adaptation & few-shot learning
• Tools for journalists & authors