The document is a thesis proposal by Justin Sybrandt at Clemson University that outlines his past and proposed work on exploiting latent features in text and graphs. It summarizes Sybrandt's peer-reviewed work using embeddings to generate biomedical hypotheses from text and validate hypotheses through ranking. It also discusses pending work on heterogeneous bipartite graph embeddings and partitioned hypergraphs. The proposal provides background on Sybrandt's hypothesis generation work and outlines his proposed future research directions involving graph embeddings.
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The process of classifying scholarly outputs is crucial to ensure timely access to knowledge. However, this process is typically carried out manually by expert editors, leading to high costs and slow throughput. In this paper we present Smart Topic Miner (STM), a novel solution which uses semantic web technologies to classify scholarly publications on the basis of a very large automatically generated ontology of research areas. STM was developed to support the Springer Nature Computer Science editorial team in classifying proceedings in the LNCS family. It analyses in real time a set of publications provided by an editor and produces a structured set of topics and a number of Springer Nature classification tags, which best characterise the given input. In this paper we present the architecture of the system and report on an evaluation study conducted with a team of Springer Nature editors. The results of the evaluation, which showed that STM classifies publications with a high degree of accuracy, are very encouraging and as a result we are currently discussing the required next steps to ensure large-scale deployment within the company.
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
In recent years we have seen the emergence of a variety of scholarly datasets. Typically these capture ‘standard’ scholarly entities and their connections, such as authors, affiliations, venues, publications, citations, and others. However, as the repositories grow and the technology improves, researchers are adding new entities to these repositories to develop a richer model of the scholarly domain. In this paper, we introduce TechMiner, a new approach, which combines NLP, machine learning and semantic technologies, for mining technologies from research publications and generating an OWL ontology describing their relationships with other research entities. The resulting knowledge base can support a number of tasks, such as: richer semantic search, which can exploit the technology dimension to support better retrieval of publications; richer expert search; monitoring the emergence and impact of new technologies, both within and across scientific fields; studying the scholarly dynamics associated with the emergence of new technologies; and others.
TechMiner was evaluated on a manually annotated gold standard and the results indicate that it significantly outperforms alternative NLP approaches and that its semantic features improve performance significantly with respect to both recall and precision.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
Clustering Citation Distributions for Semantic Categorization and Citation Prediction
by F. Osborne, S. Peroni, E. Motta
In this paper we present i) an approach for clustering authors according to their citation distributions and ii) an ontology, the Bibliometric Data Ontology, for supporting the formal representation of such clusters. This method allows the formulation of queries which take in consideration the citation behaviour of an author and predicts with a good level of accuracy future citation behaviours. We evaluate our approach with respect to alternative solutions and discuss the predicting abilities of the identified clusters.
URL: http://oro.open.ac.uk/40784/1/lisc2014.pdf
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
Automatic Classification of Springer Nature Proceedings with Smart Topic MinerFrancesco Osborne
The process of classifying scholarly outputs is crucial to ensure timely access to knowledge. However, this process is typically carried out manually by expert editors, leading to high costs and slow throughput. In this paper we present Smart Topic Miner (STM), a novel solution which uses semantic web technologies to classify scholarly publications on the basis of a very large automatically generated ontology of research areas. STM was developed to support the Springer Nature Computer Science editorial team in classifying proceedings in the LNCS family. It analyses in real time a set of publications provided by an editor and produces a structured set of topics and a number of Springer Nature classification tags, which best characterise the given input. In this paper we present the architecture of the system and report on an evaluation study conducted with a team of Springer Nature editors. The results of the evaluation, which showed that STM classifies publications with a high degree of accuracy, are very encouraging and as a result we are currently discussing the required next steps to ensure large-scale deployment within the company.
EKAW 2016 - TechMiner: Extracting Technologies from Academic PublicationsFrancesco Osborne
In recent years we have seen the emergence of a variety of scholarly datasets. Typically these capture ‘standard’ scholarly entities and their connections, such as authors, affiliations, venues, publications, citations, and others. However, as the repositories grow and the technology improves, researchers are adding new entities to these repositories to develop a richer model of the scholarly domain. In this paper, we introduce TechMiner, a new approach, which combines NLP, machine learning and semantic technologies, for mining technologies from research publications and generating an OWL ontology describing their relationships with other research entities. The resulting knowledge base can support a number of tasks, such as: richer semantic search, which can exploit the technology dimension to support better retrieval of publications; richer expert search; monitoring the emergence and impact of new technologies, both within and across scientific fields; studying the scholarly dynamics associated with the emergence of new technologies; and others.
TechMiner was evaluated on a manually annotated gold standard and the results indicate that it significantly outperforms alternative NLP approaches and that its semantic features improve performance significantly with respect to both recall and precision.
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
Classifying research papers according to their research topics is an important task to improve their retrievability, assist the creation of smart analytics, and support a variety of approaches for analysing and making sense of the research environment. In this paper, we present the CSO Classifier, a new unsupervised approach for automatically classifying research papers according to the Computer Science Ontology (CSO), a comprehensive ontology of re-search areas in the field of Computer Science. The CSO Classifier takes as input the metadata associated with a research paper (title, abstract, keywords) and returns a selection of research concepts drawn from the ontology. The approach was evaluated on a gold standard of manually annotated articles yielding a significant improvement over alternative methods.
Clustering Citation Distributions for Semantic Categorization and Citation Prediction
by F. Osborne, S. Peroni, E. Motta
In this paper we present i) an approach for clustering authors according to their citation distributions and ii) an ontology, the Bibliometric Data Ontology, for supporting the formal representation of such clusters. This method allows the formulation of queries which take in consideration the citation behaviour of an author and predicts with a good level of accuracy future citation behaviours. We evaluate our approach with respect to alternative solutions and discuss the predicting abilities of the identified clusters.
URL: http://oro.open.ac.uk/40784/1/lisc2014.pdf
Basic introduction to recommender systems + Implementing a content-based recommender system by leveraging knowledge encoded into Linked Open Data datasets
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
The ontology engineering research community has focused for many years on supporting the creation, development and evolution of ontologies. Ontology forecasting, which aims at predicting semantic changes in an ontology, represents instead a new challenge. In this paper, we want to give a contribution to this novel endeavour by focusing on the task of forecasting semantic concepts in the research domain. Indeed, ontologies representing scientific disciplines contain only research topics that are already popular enough to be selected by human experts or automatic algorithms. They are thus unfit to support tasks which require the ability of describing and exploring the forefront of research, such as trend detection and horizon scanning. We address this issue by introducing the Semantic Innovation Forecast (SIF) model, which predicts new concepts of an ontology at time t+1 , using only data available at time t. Our approach relies on lexical innovation and adoption information extracted from historical data. We evaluated the SIF model on a very large dataset consisting of over one million scientific papers belonging to the Computer Science domain: the outcomes show that the proposed approach offers a competitive boost in mean average precision-at-ten compared to the baselines when forecasting over 5 years.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Graph Analytics (or network analytics), is an area of analysis with numerous applications that increasingly draws more and more attention. From fraud detection and money laundering, illegal transactions and other forms of financial crime, to identify key influencers in social networks, communities of frequently interacting individuals and route optimisation or even bioinformatics; graph analytics offer a vast variety of solutions that keep on evolving on a daily basis; allowing for experts in various fields to tackle every day challenges, extract insights and drive decision making.
Mainly, Graph Analytics is divided into 4 categories:
* Path Analytics;
* Connectivity Analytics;
* Centrality Analytics, and;
* Community Detection analytics;
each of which relies on different algorithms and address different problems.
So, how does one apply these techniques effectively in order to drive hypothesis testing and, eventually, the extraction of actionable insights?
What are the steps that should be followed?
What is the impact of visualisation tools in this process?
How should we sample from graphs?
What tools should one use or be familiar with?
Furthermore, how can scalable and high-quality production-ready solutions be implemented that apply Graph Analytics, giving direct access to visualisations and on-demand analytics’ dashboards, which can serve as an intuitive and amenable means of information interpretation?
In this talk, we present and discuss the different categories of graph analytics and their areas of application. In addition, to address the above questions, we define a methodology for the application of these analytics technologies through example use-cases, studying the steps that need to be followed before assumptions can be confirmed and insights can be extracted.
Finally, we will discuss how distributed programming models, such as Spargel, have been developed to allow for the adoption of graph analytics algorithms by frameworks like Apache Spark and Apache Flink; the challenges and limitations that come with their adoption by these frameworks, and how one can build scalable distributed graph analytics solutions using them.
The World Wide Web is moving from a Web of hyper-linked documents to a Web of linked data. Thanks to the Semantic Web technological stack and to the more recent Linked Open Data (LOD) initiative, a vast amount of RDF data have been published in freely accessible datasets connected with each other to form the so called LOD cloud. As of today, we have tons of RDF data available in the Web of Data, but only a few applications really exploit their potential power. The availability of such data is for sure an opportunity to feed personalized information access tools such as recommender systems. We will show how to plug Linked Open Data in a recommendation engine in order to build a new generation of LOD-enabled applications.
(Lecture given @ the 11th Reasoning Web Summer School - Berlin - August 1, 2015)
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
The ontology engineering research community has focused for many years on supporting the creation, development and evolution of ontologies. Ontology forecasting, which aims at predicting semantic changes in an ontology, represents instead a new challenge. In this paper, we want to give a contribution to this novel endeavour by focusing on the task of forecasting semantic concepts in the research domain. Indeed, ontologies representing scientific disciplines contain only research topics that are already popular enough to be selected by human experts or automatic algorithms. They are thus unfit to support tasks which require the ability of describing and exploring the forefront of research, such as trend detection and horizon scanning. We address this issue by introducing the Semantic Innovation Forecast (SIF) model, which predicts new concepts of an ontology at time t+1 , using only data available at time t. Our approach relies on lexical innovation and adoption information extracted from historical data. We evaluated the SIF model on a very large dataset consisting of over one million scientific papers belonging to the Computer Science domain: the outcomes show that the proposed approach offers a competitive boost in mean average precision-at-ten compared to the baselines when forecasting over 5 years.
The task of keyword extraction is to automatically identify a set of terms that best describe the document. Automatic keyword extraction establishes a foundation for various natural language processing applications: information retrieval, the automatic indexing and classification of documents, automatic summarization and high-level semantic description, etc. Although the keyword extraction applications usually work on single documents (document-oriented task), keyword extraction is also applicable to a more demanding task, i.e. the keyword extraction from a whole collection of documents or from an entire web site, or from tweets from Twitter. In the era of big-data, obtaining an effective and efficient method for automatic keyword extraction from huge amounts of multi-topic textual sources is of high importance.
We proposed a novel Selectivity-Based Keyword Extraction (SBKE) method, which extracts keywords from the source text represented as a network. The node selectivity value is calculated from a weighted network as the average weight distributed on the links of a single node and is used in the procedure of keyword candidate ranking and extraction. The selectivity slightly outperforms an extraction based on the standard centrality measures. Therefore, the selectivity and its modification – generalized selectivity as the node centrality measures are included in the SBKE method. Selectivity-based extraction does not require linguistic knowledge as it is derived purely from statistical and structural information of the network and it can be easily ported to new languages and used in a multilingual scenario. The true potential of the proposed SBKE method is in its generality, portability and low computation costs, which positions it as a strong candidate for preparing collections which lack human annotations for keyword extraction. Testing of the portability of the SBKE was tested on Croatian, Serbian and English texts – more precisely it was developed on Croatian News and ported for extraction from parallel abstracts of scientific publication in the Serbian and English languages.
The constructed parallel corpus of scientific abstracts with annotated keywords allows a better comparison of the performance of the method across languages since we have the controlled experimental environment and data. The achieved keyword extraction results measured with an F1 score are 49.57% for English and 46.73% for the Serbian language, if we disregard keywords that are not present in the abstracts. In case that we evaluate against the whole keyword set, the F1 scores are 40.08% and 45.71% respectively. This work shows that SBKE can be easily ported to new a language, domain and type of text in the sense of its structure. Still, there are drawbacks – the method can extract only the words that appear in the text.
The Web of Data: do we actually understand what we built?Frank van Harmelen
Despite its obvious success (largest knowledge base ever built, used in practice by companies and governments alike), we actually understand very little of the structure of the Web of Data. Its formal meaning is specified in logic, but with its scale, context dependency and dynamics, the Web of Data has outgrown its traditional model-theoretic semantics.
Is the meaning of a logical statement (an edge in the graph) dependent on the cluster ("context") in which it appears? Does a more densely connected concept (node) contain more information? Is the path length between two nodes related to their semantic distance?
Properties such as clustering, connectivity and path length are not described, much less explained by model-theoretic semantics. Do such properties contribute to the meaning of a knowledge graph?
To properly understand the structure and meaning of knowledge graphs, we should no longer treat knowledge graphs as (only) a set of logical statements, but treat them properly as a graph. But how to do this is far from clear.
In this talk, I report on some of our early results on some of these questions, but I ask many more questions for which we don't have answers yet.
Towards a Quality Assessment of Web Corpora for Language Technology ApplicationsMarina Santini
In this study, we focus on the creation and evaluation of domain-specific web corpora. To this purpose, we propose a two-step approach, namely the (1) the automatic extraction and evaluation of term seeds from personas and use cases/scenarios; (2) the creation and evaluation of domain-specific web corpora bootstrapped with term seeds automatically extracted in step 1. Results are encouraging and show that: (1) it is possible to create a fairly accurate term extractor for relatively short narratives; (2) it is straightforward to evaluate a quality such as domain-specificity of web corpora using well-established metrics.
Can We Quantify Domainhood? Exploring Measures to Assess Domain-Specificity i...Marina Santini
Web corpora are a cornerstone of modern Language Technology. Corpora built from the web are convenient because their creation is fast and inexpensive. Several studies have been carried out to assess the representativeness of general-purpose web corpora by comparing them to traditional corpora. Less attention has been paid to assess the representativeness of specialized or domain-specific web corpora. In this paper, we focus on the assessment of domain representativeness of web corpora and we claim that it is possible to assess the degree of domainspecificity, or domainhood, of web corpora. We present a case study where we explore the effectiveness of different measures - namely the Mann-Withney-Wilcoxon Test, Kendall correlation coefficient, Kullback– Leibler divergence, log-likelihood and burstiness - to gauge domainhood. Our findings indicate that burstiness is the most suitable measure to single out domain-specific words from a specialized corpus and to allow for the quantification of domainhood.
Graph Analytics (or network analytics), is an area of analysis with numerous applications that increasingly draws more and more attention. From fraud detection and money laundering, illegal transactions and other forms of financial crime, to identify key influencers in social networks, communities of frequently interacting individuals and route optimisation or even bioinformatics; graph analytics offer a vast variety of solutions that keep on evolving on a daily basis; allowing for experts in various fields to tackle every day challenges, extract insights and drive decision making.
Mainly, Graph Analytics is divided into 4 categories:
* Path Analytics;
* Connectivity Analytics;
* Centrality Analytics, and;
* Community Detection analytics;
each of which relies on different algorithms and address different problems.
So, how does one apply these techniques effectively in order to drive hypothesis testing and, eventually, the extraction of actionable insights?
What are the steps that should be followed?
What is the impact of visualisation tools in this process?
How should we sample from graphs?
What tools should one use or be familiar with?
Furthermore, how can scalable and high-quality production-ready solutions be implemented that apply Graph Analytics, giving direct access to visualisations and on-demand analytics’ dashboards, which can serve as an intuitive and amenable means of information interpretation?
In this talk, we present and discuss the different categories of graph analytics and their areas of application. In addition, to address the above questions, we define a methodology for the application of these analytics technologies through example use-cases, studying the steps that need to be followed before assumptions can be confirmed and insights can be extracted.
Finally, we will discuss how distributed programming models, such as Spargel, have been developed to allow for the adoption of graph analytics algorithms by frameworks like Apache Spark and Apache Flink; the challenges and limitations that come with their adoption by these frameworks, and how one can build scalable distributed graph analytics solutions using them.
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs t...Databricks
It is widely known that the discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market.
AstraZeneca is a global, innovation-driven biopharmaceutical business that focuses on the discovery, development, and commercialization of prescription medicines for some of the world’s most serious diseases. Our scientists have been able to improve our success rate over the past 5 years by moving to a data-driven approach (the “5R”) to help develop better drugs faster, choose the right treatment for a patient and run safer clinical trials.
However, our scientists are still unable to make these decisions with all of the available scientific information at their fingertips. Data is sparse across our company as well as external public databases, every new technology requires a different data processing pipeline and new data comes at an increasing pace. It is often repeated that a new scientific paper appears every 30 seconds, which makes it impossible for any individual expert to keep up-to-date with the pace of scientific discovery.
To help our scientists integrate all of this information and make targeted decisions, we have used Spark on Azure Databricks to build a knowledge graph of biological insights and facts. The graph powers a recommendation system which enables any AZ scientist to generate novel target hypotheses, for any disease, leveraging all of our data.
In this talk, I will describe the applications of our knowledge graph and focus on the Spark pipelines we built to quickly assemble and create projections of the graph from 100s of sources. I will also describe the NLP pipelines we have built – leveraging spacy, bioBERT or snorkel – to reliably extract meaningful relations between entities and add them to our knowledge graph.
Graph Techniques for Natural Language ProcessingSujit Pal
Natural Language embodies the human ability to make “infinite use of finite means” (Humboldt, 1836; Chomsky, 1965). A relatively small number of words can be combined using a grammar in myriad different ways to convey all kinds of information. Languages model inter-relationships between their words, just like graphs model inter-relationships between their vertices. It is not surprising then, that graphs are a natural tool to study Natural Language and glean useful information from it, automatically, and at scale. This presentation will focus on NLP techniques to convert raw text to graphs, and present Graph Theory based solutions to some common NLP problems. Solutions presented will use Apache Spark or Neo4j depending on problem size and scale. Examples of Graph Theory solutions presented include PageRank for Document Summarization, Link Prediction from raw text for Knowledge Graph enhancement, Label Propagation for entity classification, and Random Walk techniques to find similar documents.
New learning technologies seem likely to transform much of science, as they are already doing for many areas of industry and society. We can expect these technologies to be used, for example, to obtain new insights from massive scientific data and to automate research processes. However, success in such endeavors will require new learning systems: scientific computing platforms, methods, and software that enable the large-scale application of learning technologies. These systems will need to enable learning from extremely large quantities of data; the management of large and complex data, models, and workflows; and the delivery of learning capabilities to many thousands of scientists. In this talk, I review these challenges and opportunities and describe systems that my colleagues and I are developing to enable the application of learning throughout the research process, from data acquisition to analysis.
Relationships Matter: Using Connected Data for Better Machine LearningNeo4j
Relationships are highly predictive of behavior, yet most data science models overlook this information because it's difficult to extract network structure for use in machine learning (ML).
With graphs, relationships are embedded in the data itself, making it practical to add these predictive capabilities to your existing practices.
That’s why we’re presenting and demoing the use of graph-native ML to make breakthrough predictions. This will cover:
- Different approaches to graph feature engineering, from queries and algorithms to embeddings
- How ML techniques leverage everything from classical network science to deep learning and graph convolutional neural networks
- How to generate representations of your graph using graph embeddings, create ML models for link prediction or node classification, and apply these models to add missing information to an existing graph/incoming data
- Why no-code visualization and prototyping is important
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker, Inc.
Typically in predictive data analysis challenges, participants are provided a dataset and asked to make predictions. Participants include with their prediction the scripts/code used to produce it. Challenge administrators validate the winning model by reconstructing and running the source code.
Often data cannot be provided to participants directly, e.g. due to data sensitivity (data may be from living human subjects) or data size (tens of terabytes). Further, predictions must be reproducible from the code provided by particpants. Containerization is an excellent solution to these problems: Rather than providing the data to the participants, we ask the participants to provided a Dockerized "trainable" model. We run the both the training and validation phases of machine learning and guarantee reproducibility 'for free'.
We use the Docker tool suite to spin up and run servers in the cloud to process the queue of submitted containers, each essentially a batch job. This fleet can be scaled to match the level of activity in the challenge. We have used Docker successfully in our 2015 ALS Stratification Challenge and our 2015 Somatic Mutation Calling Tumour Heterogeneity (SMC-HET) Challenge, and are starting an implementation for our 2016 Digitial Mammography Challenge.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
Keynote given at the workshop for Artificial Intelligence meets the Web of Data on Pragmatic Semantics.
In this keynote I argue that the Web of Data is a Complex System or Marketplace of Ideas rather than a classical Database, and that the model theory on which classical semantics are based is not appropriate in all situations, and propose an alternative "Pragmatic Semantics" based on optimisation of possible interpretations. .
An Empirical Study on the Adequacy of Testing in Open Source ProjectsPavneet Singh Kochhar
In this study, we investigate the state-of-the-practice of testing
by measuring code coverage in open-source software projects. We examine over 300 large open-source projects written in Java, to measure the code coverage of their associated test cases.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
We describe an approach to find similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on the combination of Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and find similarities among them. The result of this work can be applied for easing the task of data interlinking and for promoting data reusing in the Semantic Web.
Full paper at: http://memaldi.github.io/pdf/iesd2015.pdf
Similar to Sybrandt Thesis Proposal Presentation (20)
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
A brief information about the SCOP protein database used in bioinformatics.
The Structural Classification of Proteins (SCOP) database is a comprehensive and authoritative resource for the structural and evolutionary relationships of proteins. It provides a detailed and curated classification of protein structures, grouping them into families, superfamilies, and folds based on their structural and sequence similarities.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Mammalian Pineal Body Structure and Also Functions
Sybrandt Thesis Proposal Presentation
1. Exploiting Latent Features of Text and Graphs
Thesis Proposal
Justin Sybrandt
Clemson University
April 30, 2019
1
2. Peer-Reviewed Work:
• Sybrandt, J., Shtutman, M., & Safro, I. (2017, August). Moliere: Automatic biomedical hypothesis
generation system. In Proceedings of the 23rd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 1633-1642). ACM.
• Acceptance rate 8.8%.
• 1 Non-author citation.
• Sybrandt, J., Shtutman, M., & Safro, I. (2018, December). Large-scale validation of hypothesis
generation systems via candidate ranking. In 2018 IEEE International Conference on Big Data (Big
Data) (pp. 1494-1503). IEEE.
• Acceptance rate 18%.
• Sybrandt, J., Carrabba, A., Herzog, A., & Safro, I. (2018, December). Are abstracts enough for
hypothesis generation? In 2018 IEEE International Conference on Big Data (Big Data) (pp.
1504-1513). IEEE.
• Acceptance rate 18%.
2
3. Pending Works:
• Sybrandt, J., & Safro, I. Heterogeneous Bipartite Graph Embeddings. In-submission and not
available online.
• Sybrandt, J., & Shaydulin, R., & Safro, I. Partition Hypergraphs with Embeddings. Not available
online.
• Aksenova M., Sybrandt J., Cui B., Sikirzhytski V., Ji H., Odhiambo D., Lucius M., Turner J. R.,
Broude E., Pena E., Lizzaraga S., Zhu J., Safro I., Wyatt M. D., Shtutman M. (2019). Inhibition of
the DDX3 prevents HIV-1 Tat and cocaine-induced neurotoxicity by targeting microglia activation.
https://www.biorxiv.org/content/10.1101/591438v1
• Locke, W., Sybrandt, J., Safro, I., & Atamturktur, S. (2018, November 12). Using Drive-by Health
Monitoring to Detect Bridge Damage Considering Environmental and Operational Effects.
https://doi.org/10.31224/osf.io/ntfdp
3
4. Other Works:
Peer-Reviewed Extended Abstracts:
• Aksenova, M., Sybrandt, J., Cui, B., Lucius, M., Ji, H., Wyatt, M., Safro, I., Zhu, J., & Shtutman,
M. (2019). Inhibition of the DEAD Box RNA Helicase 3 prevents HIV-1 Tat- and cocaine-induced
neurotoxicity by targeting microglai activation. In 2019 Meeting of the NIDA Genetic Consortium.
Extended Abstract & Poster
• Sybrandt, J., & Hick, J. (2015). Rapid replication of multi-petabyte file systems. Work in progress
in the 2015 Parallel Data Storage Workshop. Poster in 2015 Super Computing.
Online Work:
• Shaydulin, R., & Sybrandt, J. (2017). To Agile, or not to Agile: A Comparison of Software
Development Methodologies. arXiv preprint arXiv:1704.07469.
• 5 Non-author citations.
4
5. Overview
• Latent Variables
• Unobservable qualities of a dataset.
• Text Embeddings
• Transferable textual latent features.
• Correspond to semantic properties of words.
• Graph Embeddings
• Underlying network features.
• Correspond to roles, communities, and unobserved node-features.
5
6. Outline
Hypothesis Generation
Moliere: Automatic Biomedical Hypothesis Generation
Validation via Candidate Ranking
Are Abstracts Enough?
Graph Embedding
Heterogeneous Bipartite Graph Embedding
Partition Hypergraphs with Embeddings
Proposed Work
6
8. Drug Discovery Problem Overview
• Medical research is expensive and risky.
• Text mining can identify fruitful research directions before expensive
experiments.
Wall Street Journal
Pfizer Ends Hunt for Drugs to Treat
Alzheimer’s and Parkinson’s
8
10. • Automatic Biomedical Hypothesis Generation System
• Basic Pipeline
• Data collection
• Network construction
• Abstract identification
• Topic modeling
10
11. Data Collection
• Titles & Abstracts
Tumours evade immune control by creating hostile microenvironments that perturb T cell
metabolism and effector function.
• Phrases (n-grams)
• “T cell metabolism”
• Predicates
• tumours → evade → immune control
• Unified Medical Language System
Neoplasms
tumor, tumour, oncological abnormality
11
12. Medical Text Embedding
• Use fasttext to capture latent features [3].
• Semantically similar items are nearest
neighbors.
12
15. Topic Modeling
• Run LDA topic modeling [2].
• Analyze word patterns across topics.
• Example: Venlafaxine interacts with HTR1A.
15
16. Automating Analysis
• Studying topics manually is infeasible.
• We propose plausibility ranking criteria.
• Drug discovery is a ranking problem.
• Allows for large-scale numerical evaluation.
16
17. Validation Through Ranking
• Evaluate sorting criteria through ROC.
• “Synthetic” experiment, similar to drug
discovery.
• Identify recent discoveries.
• Create negative samples.
• Propose ranking criteria.
17
18. Validation Data
• Set “cut-date.”
• Training Data:
• All papers published prior.
• Validation Data:
• Recent discoveries: SemMedDB pairs
first occurring after.
• Negative samples: Random UMLS
pairs never occurring.
18
19. Proposed Ranking: Embedding
• Measure distance between query terms a and c.
• Calculate weighted centroid for each topic.
• Measure distances between terms and topics.
19
20. Proposed Ranking: Topics
• Create nearest-neighbors network from topic embeddings.
• Measure network statistics of shortest path a − c.
20
21. Combination Ranking
• Embedding measures
• CSim(a, c): Cosine Similarity of Embeddings
• L2(a, c): Euclidean Distance of Embeddings
• BestCentrCSim(a, c, T ): Maximum joint topic similarity.
• BestCentrL2(a, c, T ): Minimum joint topic distance.
• BestTopPerWord(a, c, T ): Max of minimum joint similarity.
• TopicCorr(a, c, T ): Correlation of Topic Distances
• Topic network measures
• TopWalkLength(a, c, T ): Length of shortest path a ∼ c
• TopWalkBtwn(a, c, T ): Avg. a ∼ c betweenness centrality
• TopWalkEigen(a, c, T ): Avg. a ∼ c eigenvalue centrality
• TopNetCCoef(a, c, T ): Clustering coefficient of N
• TopNetMod(a, c, T ): Modularity of N
PolyMultiple(a, c, T ) = α1 · L
β1
2
+ α2 · BestCenterL
β2
2
+ α3 · BestTopPerWord(a, c, T )
β3 + α4 · TopCorr(a, c, T )
β4
+ α5 · TopWalkBtwn(a, c, T )
β5 + α6 · TopNetCCoef(a, c, T )
β6
21
23. Testing in the Real World
• Apply ranking criteria for
laboratory experiments.
• HIV-associated Neurodegenerative
Disorder
• 30% of HIV patents over 60 develop
dementia.
• We searched 40k gene relationships.
• Identified DDX3X.
23
27. Experiments
• Create separate instances of Moliere
• From training vector space to Polynomial.
• Datasets
• Iterative halves of PubMed.
• PubMed Central full texts & abstracts.
• Evaluation
• Create shared set of positive and negative hypotheses.
• Cut year of 2015.
• Rank and calculate ROC curves.
27
28. Dataset Details
Corpus Total Words Unique Words Corpus Size Median Words
per Document
PMC Abstracts 109,987,863 673,389 1,086,704 102
PMC Full-Text 1,860,907,606 6,548,236 1,086,704 1594
PubMed 1,852,059,044 2,410,130 24,284,910 71
1/2 PubMed 923,679,660 1,505,672 12,142,455 71
1/4 PubMed 460,384,928 920,734 6,071,227 71
1/8 PubMed 229,452,214 565,270 3,035,613 71
1/16 PubMed 114,385,607 349,174 1,517,806 71
28
29. Abstract vs Full Text Results
• Summarize performance with L2
and Polynomial metrics.
• Polynomial evaluates whole system.
• L2 evaluates embedding.
29
30. Beyond Quality
• Full text improves performance quality by about 10%.
• Full text increases runtime from 2m to 1.5h.
• Topic modeling:
• Primary runtime increase.
• Less interpretable topics.
30
31. Effects
• Corpus size:
• Longer corpus helps slightly.
• Document length:
• Longer documents significantly improve word embeddings.
• Increase runtime.
• Abstracts vs. Full text
• Content in full texts not found in abstracts.
31
32. Hypothesis Generation Summary
• Moliere
• Use embedding to make network, find abstracts, perform topic modeling.
• Validation via Candidate Ranking
• Propose metrics to quality embedding and topic model qualities.
• Evaluate recently published results, extend to real-world experiments.
• Are Abstracts Enough?
• Compare performance of system across different corpus qualities.
• Full texts improve performance at large runtime penalty.
32
35. Skip-Gram Model [17]
• Sample windows centered on target
word.
• Predict leading & trailing context
from target embedding.
• Assumption: “Similar words share
similar company.”
35
36. Deepwalk [19]
• Sample random walks from graph.
• Interpret walks as “sentences.”
• Apply Skip-Gram model.
36
37. LINE [21]
• Sample first- & second-order neighbors.
• Fit observed samples to embeddings.
• Observed probability between u & v:
p(u, v) =
wuv
(i,j)∈E wij
• Predicted probability:
ˆp(u, v, ) = σ( (u) (v))
• Minimize KL-Divergence between p and ˆp.
37
38. Node2Vec [10]
• Combine structural and
homophilic.
• Blends breadth- and depth-first walks.
• Adds return- and out-parameters.
38
40. Heterogeneous Bipartite Graphs
• Contains two node types.
• GB = (V, E)
• V = A ∪ B
• A and B are disjoint.
• Neighborhood Γ(i).
• If i ∈ A, then Γ(i) ⊆ B.
40
41. Proposed Methods
• Boolean Heterogeneous Bipartite
Embedding
• Weight all samples equally.
• Sample direct and first-order
relationships.
• Algebraic Heterogeneous Bipartite
Embedding
• Weight sampling with algebraic
distance.
• Sample direct, first-, and second-order
relationships.
Both Methods:
• Enable type-specific latent features.
• Make only same-type comparisons.
41
42. Algorithm Outline
• Observed similarities:
• SA(i, j), SB(i, j), SAB(i, j)
• Predicted similarities w.r.t. embedding ( : V → Rk
):
• SA(i, j, ), SB(i, j, ), SAB(i, j, )
• Optimize:
• Minimize difference between S and S.
42
48. Algebraic Distance I
• Stationary iterative relaxation.
• Algebraic distance for hypergraphs [20], adapted for bipartite graphs:
a0 ∼ [0, 1]
at+1(i) = λat(i) + (1 − λ)
j∈Γ(i) at(j)|Γ(j)|−1
j∈Γ(i) |Γ(j)|−1
• Between each iteration, rescale to [0, 1].
48
49. Algebraic Distance II
• Run T = 20 algebraic distance trials until stabilization.
• Summarize distances across all trials:
d(i, j) =
T
t =1
a
(t )
∞ (i) − a
(t )
∞ (j)
2
• Summarize similarity between nodes:
s(i, j) =
√
T − d(i, j)
√
T
49
50. Algebraic Observations
• Same-type:
• Two A nodes are similar if any B node is highly similar to both.
SA(i, j) = SB(i, j) = max
k∈Γ(i)∩Γ(j)
min (s(i, k), s(k, j))
• Cross-type, sample both direct and second-order neighbors:
• Decompose cross-type comparisons to same-typed neighborhoods.
SAB(i, j) = max max
k∈Γ(j)
SA(i, k), max
k∈Γ(i)
SB(k, j)
50
51. Algebraic Predictions
• Predicted same-type relationships:
SA(i, j, ) = max (0, (i) (j))
• Predicted same- and cross-typed relationships:
SAB(i, j, ) = E
k∈Γ(j)
SA(i, k, ) E
k∈Γ(i)
SB(k, j, )
51
52. Algebraic Optimization
• Loss for a particular similarity:
OX = E
i,j∈V
SX(i, j) − SX(i, j, )
2
• Optimization:
min OA + OAB + OB
52
53. New Embedding Methods
• Boolean Heterogeneous Bipartite
Embedding (BHBE)
• Observes existence of relationships.
• Predicts using σ( (i) (j)).
• Minimizes KL-Divergence.
• Algebraic Heterogeneous Bipartite
Embedding (AHBE)
• Observes relationships weighted
through algebraic distance.
• Predicts using max(0, (i) (j)).
• Minimizes Mean-Squared Error.
53
54. Combination Embedding
• Learn joint representation to
combine AHBE & BHBE.
• Direct encoding predicts links through
joint embedding.
• Auto-regularized encoding also
enforces all latent features are
captured.
54
55. Evaluation I
• Link prediction task.
• Select graph, delete % of edges.
• Embed remaining graph.
• Use embeddings to recover removed edges.
• Explore varying hold-out percentages.
• From 10% to 90% splits.
• Increments of 10%.
55
56. Evaluation II
• Models:
• A- and B-Personalized.
• Train a SVM for each node in the training set.
• Each SVM detects an embedding region
containing neighbors.
• Unified.
• Train a shallow neural network.
• Predict links given one embedding of each
type.
A-Personalized Model
56
57. Benchmark
Γ(i ∈ A) Γ(j ∈ B)
Graph |A|/|B| md max md max SR LCP
Amazon 16,716/5,000 3 49 8 328 75.8 1.6
DBLP 93,432/5,000 1 12 8 7,556 174.7 81.7
Friendster 220,015/5,000 1 26 133 1,612 80.3 58.3
Livejournal 84,438/5,000 1 20 16 1,441 100.9 27.0
MadGrades 11,951/6,462 3 39 4 393 57.3 99.7
YouTube 39,841/5,000 1 54 4 2,217 113.3 80.6
Table: Graph summary. We report the median (md) and max degree for each node set, as well as the
Spectral Radius (SR) and the percentage of the largest connected component (LCP).
57
63. Hypergraph Partitioning: Problem Description I
• Problem: Split V into k disjoint sets...
• of approximately equal size.
• minimizing an objective of cut hyperedges.
63
64. Hypergraph Partitioning: Problem Description II
• Partition:
• V = V1 ∪ V2 ∪ · · · ∪ Vk
• ∀(Vi, Vj), Vi ∩ Vj = ∅
• Ecut = {e ∈ E : Vi, e ⊆ Vi}
• Metrics:
λcut := |Ecut|
λk−1 :=
e∈Ecut
|{Vi : Vi ∩ e = ∅}| − 1
64
65. Hypergraph Partitioning: Problem Description III
• Hypergraph partitioning is NP-Hard...
• to solve [15].
• to approximate [5].
65
66. Multilevel Heuristic
• Steps:
• Coarsen
• Initial Solution
• Uncoarsen: Interpolate &
local search.
• Paradigms:
• (log n)-Level: Each level, pair
almost all nodes.
• n-Level: Each level, pair two
nodes.
66
67. Coarsening
• Desired coarsening properties [12]:
• Reduce number of nodes & hyperedges.
• Remain structurally similar.
Contribution
Use hypergraph embeddings to better
coarsen nodes.
67
68. Typical Coarsening
• Assigning all nodes & hyperedges uniform weights: wi = 1.
• Measure similarities (e.g. hyperedge inner product).
SE(u, v) =
e∈E|u,v∈e
we
|e| − 1
• Match nodes into (u, v).
• Contract u and v into x.
• wx = wu + wv
• x participates in all hyperedges of u and v.
68
69. Embedding-based Coarsening
• Quantify similarity within embedding.
S (u, v) = (u) (v)
• Prioritize nodes with highly similar neighbors.
• Measure node similarities:
S(u, v) =
SES
wuwv
69
70. Embedding-based Coarsening Algorithm
• Sort V by each node’s highest neighbor similarity.
SortingCriteriau = max
v∈Γ(u)
S (u, v)
• In sorted order, pair nodes.
Partneru = argmax
v∈Γ(u)
SE(u, v)S (u, v)
wuwv
• Merge (u, v) to coarse node x.
• Assign (x) to be the centroid of its contained nodes.
70
71. Evaluation Method I
• Proposed Implementations:
• Zoltan: (log n)-Level, fast and highly parallel.
• KaHyPar: n-Level, high-quality partitioning.
• KaHyPar Flow: n-Level, best known algorithm.
• Considered Embeddings:
• MetaPath2Vec++
• Node2Vec
• AHBE, BHBE
• Combinations (AHBE+BHBE), (All)
71
77. Partitioning Summary
• Embeddings improve multilevel hypergraph partitioning.
• Latent features indicate relevant structural similarities.
• Embedding similarity prioritizes and matches nodes.
• Most important for small partition counts (k).
• Embeddings capture key clusters.
• Centroids during coarsening smooth some finer details.
• Latent features are most important for particular graphs.
• Social networks.
• Synthetic graphs.
77
79. Future Directions
• Bias in Scientific Embeddings
• Detect confirmation bias.
• Normalize effect of “group think.”
• Hybrid Knowledge Graph Mining
• Train text and graph embeddings.
• Formulate hypothesis generation for
deep learning.
79
80. Bias in Word Embeddings
• Recent work finds gender stereotypes in word embeddings [4].
• Biases exist in science.
• Confirmation bias [18]
• Over-interpreting noise [7].
• P-hacking [11].
• Example P53
• “[A]valanche of research” [22]
• What connections to focus on? Which are noise?
80
81. Hybrid Knowledge Graph Mining for Hypothesis Generation
• Knowledge Graphs
• Triplets, similar to SemMedDB Predicates
• Typed relationships
• Specialized Techniques
• Edge2Vec [9].
• Use text to augment graphs [16].
• SciBERT [1].
81
82. Attention-Based Hypothesis Generation
• Attention-mechanism creates interpretable results.
• Assigns weights to relevant input.
• Unified deep-learning model.
• Input: Embeddings for text and network features.
• Potential Outputs:
• Connection strength.
• Connection type.
• Automatic summary.
82
83. Timeline
Date Accomplishment
April 2019 Dissertation proposal.
August 2019 Return from summer internship. Begin exploring bias in scientific embed-
dings.
November 2019 Complete analysis of bias in scientific embeddings. Begin exploring deep
learning on knowledge graphs for hypothesis generation.
April 2020 Complete analysis of deep learning and knowledge graphs for hypothesis
generation.
June 2020 Dissertation defense.
Table: Timeline of proposed work.
83
85. Bibliography I
[1] Beltagy, I., Cohan, A., and Lo, K.
Scibert: Pretrained contextualized embeddings for scientific text.
arXiv preprint arXiv:1903.10676 (2019).
[2] Blei, D. M., Ng, A. Y., and Jordan, M. I.
Latent dirichlet allocation.
993–1022.
[3] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T.
Enriching word vectors with subword information.
Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[4] Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and Kalai, A. T.
Man is to computer programmer as woman is to homemaker? debiasing word embeddings.
In Advances in neural information processing systems (2016), pp. 4349–4357.
85
86. Bibliography II
[5] Bui, T. N., and Jones, C.
Finding good approximate vertex and edge partitions is np-hard.
Information Processing Letters 42, 3 (1992), 153–159.
[6] C¸ataly¨urek, ¨U., and Aykanat, C.
Patoh (partitioning tool for hypergraphs).
Encyclopedia of Parallel Computing (2011), 1479–1487.
[7] de Groot, A. D.
The meaning of “significance” for different types of research [translated and annotated by eric-jan
wagenmakers, denny borsboom, josine verhagen, rogier kievit, marjan bakker, angelique cramer,
dora matzke, don mellenbergh, and han lj van der maas].
Acta psychologica 148 (2014), 188–194.
86
87. Bibliography III
[8] Devine, K. D., Boman, E. G., Heaphy, R. T., Bisseling, R. H., and Catalyurek,
U. V.
Parallel hypergraph partitioning for scientific computing.
In Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International (2006),
IEEE, pp. 10–pp.
[9] Gao, Z., Fu, G., Ouyang, C., Tsutsui, S., Liu, X., and Ding, Y.
edge2vec: Learning node representation using edge semantics.
arXiv preprint arXiv:1809.02269 (2018).
[10] Grover, A., and Leskovec, J.
node2vec: Scalable feature learning for networks.
In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and
data mining (2016), ACM, pp. 855–864.
87
88. Bibliography IV
[11] Head, M. L., Holman, L., Lanfear, R., Kahn, A. T., and Jennions, M. D.
The extent and consequences of p-hacking in science.
PLoS biology 13, 3 (2015), e1002106.
[12] Heuer, T., Sanders, P., and Schlag, S.
Network Flow-Based Refinement for Multilevel Hypergraph Partitioning.
In 17th International Symposium on Experimental Algorithms (SEA 2018) (2018), pp. 1:1–1:19.
[13] Heuer, T., and Schlag, S.
Improving coarsening schemes for hypergraph partitioning by exploiting community structure.
In 16th International Symposium on Experimental Algorithms, (SEA 2017) (2017),
pp. 21:1–21:19.
[14] Karypis, G.
hmetis 1.5: A hypergraph partitioning package.
http://www. cs. umn. edu/˜ metis (1998).
88
89. Bibliography V
[15] Lengauer, T.
Combinatorial algorithms for integrated circuit layout.
Springer Science & Business Media, 2012.
[16] Li, C., Lai, Y.-Y., Neville, J., and Goldwasser, D.
Joint embedding models for textual and social analysis.
In The Workshop on Deep Structured Prediction (2017).
[17] Mikolov, T., Chen, K., Corrado, G., and Dean, J.
Efficient estimation of word representations in vector space.
arXiv preprint arXiv:1301.3781 (2013).
89
90. Bibliography VI
[18] Munaf`o, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D.,
Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., and Ioannidis,
J. P.
A manifesto for reproducible science.
Nature human behaviour 1, 1 (2017), 0021.
[19] Perozzi, B., Al-Rfou, R., and Skiena, S.
Deepwalk: Online learning of social representations.
In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining (2014), ACM, pp. 701–710.
[20] Shaydulin, R., Chen, J., and Safro, I.
Relaxation-based coarsening for multilevel hypergraph partitioning.
Multiscale Modeling & Simulation 17, 1 (2019), 482–506.
90
91. Bibliography VII
[21] Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q.
Line: Large-scale information network embedding.
In Proceedings of the 24th International Conference on World Wide Web (2015), International
World Wide Web Conferences Steering Committee, pp. 1067–1077.
[22] Vogelstein, B., Lane, D., and Levine, A. J.
Surfing the p53 network.
Nature 408, 6810 (2000), 307.
91
92. Outline
Hypothesis Generation
Moliere: Automatic Biomedical Hypothesis Generation
Validation via Candidate Ranking
Are Abstracts Enough?
Graph Embedding
Heterogeneous Bipartite Graph Embedding
Partition Hypergraphs with Embeddings
Proposed Work
92