Natural Language embodies the human ability to make “infinite use of finite means” (Humboldt, 1836; Chomsky, 1965). A relatively small number of words can be combined using a grammar in myriad different ways to convey all kinds of information. Languages model inter-relationships between their words, just like graphs model inter-relationships between their vertices. It is not surprising then, that graphs are a natural tool to study Natural Language and glean useful information from it, automatically, and at scale. This presentation will focus on NLP techniques to convert raw text to graphs, and present Graph Theory based solutions to some common NLP problems. Solutions presented will use Apache Spark or Neo4j depending on problem size and scale. Examples of Graph Theory solutions presented include PageRank for Document Summarization, Link Prediction from raw text for Knowledge Graph enhancement, Label Propagation for entity classification, and Random Walk techniques to find similar documents.
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
This report discusses three submissions based on the Duet architecture to the Deep Learning track at TREC 2019. For the document retrieval task, we adapt the Duet model to ingest a "multiple field" view of documents—we refer to the new architecture as Duet with Multiple Fields (DuetMF). A second submission combines the DuetMF model with other neural and traditional relevance estimators in a learning-to-rank framework and achieves improved performance over the DuetMF baseline. For the passage retrieval task, we submit a single run based on an ensemble of eight Duet models.
A fundamental goal of search engines is to identify, given a query, documents that have relevant text. This is intrinsically difficult because the query and the document may use different vocabulary, or the document may contain query words without being relevant. We investigate neural word embeddings as a source of evidence in document ranking. We train a word2vec embedding model on a large unlabelled query corpus, but in contrast to how the model is commonly used, we retain both the input and the output projections, allowing us to leverage both the embedding spaces to derive richer distributional relationships. During ranking we map the query words into the input space and the document words into the output space, and compute a query-document relevance score by aggregating the cosine similarities across all the query-document word pairs.
We postulate that the proposed Dual Embedding Space Model (DESM) captures evidence on whether a document is about a query term in addition to what is modelled by traditional term-frequency based approaches. Our experiments show that the DESM can re-rank top documents returned by a commercial Web search engine, like Bing, better than a term-matching based signal like TF-IDF. However, when ranking a larger set of candidate documents, we find the embeddings-based approach is prone to false positives, retrieving documents that are only loosely related to the query. We demonstrate that this problem can be solved effectively by ranking based on a linear mixture of the DESM and the word counting features.
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
Traditionally, machine learning based approaches to information retrieval have taken the form of supervised learning-to-rank models. Recently, other machine learning approaches—such as adversarial learning and reinforcement learning—have started to find interesting applications in retrieval systems. At Bing, we have been exploring some of these methods in the context of web search. In this talk, I will share couple of our recent work in this area that we presented at SIGIR 2018.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
Search logs contain examples of frequently occurring patterns of user reformulations of queries. Intuitively, the reformulation "san francisco" → "san francisco 49ers" is semantically similar to "detroit" →"detroit lions". Likewise, "london"→"things to do in london" and "new york"→"new york tourist attractions" can also be considered similar transitions in intent. The reformulation "movies" → "new movies" and "york" → "new york", however, are clearly different despite the lexical similarities in the two reformulations. In this paper, we study the distributed representation of queries learnt by deep neural network models, such as the Convolutional Latent Semantic Model, and show that they can be used to represent query reformulations as vectors. These reformulation vectors exhibit favourable properties such as mapping semantically and syntactically similar query changes closer in the embedding space. Our work is motivated by the success of continuous space language models in capturing relationships between words and their meanings using offset vectors. We demonstrate a way to extend the same intuition to represent query reformulations.
Furthermore, we show that the distributed representations of queries and reformulations are both useful for modelling session context for query prediction tasks, such as for query auto-completion (QAC) ranking. Our empirical study demonstrates that short-term (session) history context features based on these two representations improves the mean reciprocal rank (MRR) for the QAC ranking task by more than 10% over a supervised ranker baseline. Our results also show that by using features based on both these representations together we achieve a better performance, than either of them individually.
Paper: http://research.microsoft.com/apps/pubs/default.aspx?id=244728
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
Automatic Personality Prediction with Attention-based Neural NetworksJinho Choi
Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form ``If a is to a' then b is to what?'' are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.
Deep Learning approaches for Hate speech detection. In this work we used the two deep learning approaches DCNN and MLP two separate classifier on four publicly available datasets.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Author Identification of Source Code Segments Written by Multiple Authors Usi...Parvez Mahbub
Authors:
Parvez Mahbub
Department of Computer Science
Dalhousie University
parvezmrobin@dal.ca
Naz Zarreen Oishie
Department of Computer Science
University of Saskatchewan
naz.oishie@usask.ca
S.M. Rafizul Haque
CSE Discipline
Khulna University
rafizul@cse.ku.ac.bd
AOTO: Adaptive overlay topology optimization in unstructured P2P systemsZhenyun Zhuang
IEEE GLOBECOM 2003
Peer-to-Peer (P2P) systems are self-organized and
decentralized. However, the mechanism of a peer randomly
joining and leaving a P2P network causes topology mismatch-
ing between the P2P logical overlay network and the physical
underlying network. The topology mismatching problem brings
great stress on the Internet infrastructure and seriously limits
the performance gain from various search or routing tech-
niques. We propose the Adaptive Overlay Topology Optimiza-
tion (AOTO) technique, an algorithm of building an overlay
multicast tree among each source node and its direct logical
neighbors so as to alleviate the mismatching problem by choos-
ing closer nodes as logical neighbors, while providing a larger
query coverage range. AOTO is scalable and completely dis-
tributed in the sense that it does not require global knowledge
of the whole overlay network when each node is optimizing the
organization of its logical neighbors. The simulation shows that
AOTO can effectively solve the mismatching problem and re-
duce more than 55% of the traffic generated by the P2P system itself.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
With the rise of Web 2.0, API-based software has appeared. This article examines the API-based search tool created for the Korean search engine Naver: Webonaver (Webometrics Tool for Naver). The software is able to collect large amounts of data automatically and can easily distinguish between different types of information on the web, which was impossible before. In particular, Internet researchers can improve efficiency of data analysis within a specified timeframe using this tool. This paper illustrates how to use WeboNaver and tries to verify the usability and reliability through several case studies. In this article, Korean National Assembly Members’ web presence was analyzed, as was the web presence of the term H1N1.
Web 2.0의 도래와 함께 Open API를 응용한 소프트웨어 프로그램이 등장하면서 더 이상 사용자들은 웹에서 정보를 수동으로 검색하면서 일일이 살펴보는 번거로움을 겪지 않아도 된다. 공개된 API를 활용해 몇 번의 간단한 조작으로 방대한 데이터를 체계적으로 수집하고 관리할 수 있다. 본 논문은 Open API를 응용해 개발한 검색전문 프로그램 WeboNaver(Webometrics Tool for Naver)를 소개한다. 이는 한국에서 가장 영향력 있는 검색엔진 중의 하나인 네이버를 이용해 방대한 데이터를 카테고리별로 자동수집하여 저장해주는 프로그램이다. 연구자들은 이를 활용해 데이터 관리와 처리, 분석 과정에 정확성과 고도의 효율성을 기할 수 있을 것이다. 논문의 목적은 WeboNaver의 사용을 원하는 학생, 일반인, 연구자의 이해를 돕고자 실제 사례들을 통하여 분석절차를 구체적으로 제시해 그 유용성을 입증하는 것이다. 이 프로그램을 사용하여 18대 국회의원 292명의 웹가시성을 조사하였다. 또한 신종플루와 관련된 단어들의 웹 가시성을 분석하였다.
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)Konstantinos Zagoris
H-KWS 2014 is the Handwritten Keyword Spotting Competition organized in conjunction with ICFHR 2014 conference. The main objective of the competition is to record current advances in keyword spotting algorithms using established performance evaluation measures frequently encountered in the information retrieval literature. The competition comprises two distinct tracks, namely, a segmentation-based and a segmentation- free track. Five (5) distinct research groups have participated in the competition with three (3) methods for the segmentation- based track and four (4) methods for the segmentation-free track. The benchmarking datasets that were used in the contest contain both historical and modern documents from multiple writers. In this paper, the contest details are reported including the evaluation measures and the performance of the submitted methods along with a short description of each method.
Automatic Personality Prediction with Attention-based Neural NetworksJinho Choi
Distributional Semantic word representation allows Natural Language Processing systems to extract and model an immense amount of information about a language. This technique maps words into a high dimensional continuous space through the use of a single-layer neural network. This process has allowed for advances in many Natural Language Processing research areas and tasks. These representation models are evaluated with the use of analogy tests, questions of the form ``If a is to a' then b is to what?'' are answered by composing multiple word vectors and searching the vector space. During the neural network training process, each word is examined as a member of its context. Generally, a word's context is considered to be the elements adjacent to it within a sentence. While some work has been conducted examining the effect of expanding this definition, very little exploration has been done in this area. Further, no inquiry has been conducted as to the specific linguistic competencies of these models or whether modifying their contexts impacts the information they extract. In this paper we propose a thorough analysis of the various lexical and grammatical competencies of distributional semantic models. We aim to leverage analogy tests to evaluate the most advanced distributional model across 14 different types of linguistic relationships. With this information we will then be able to investigate as to whether modifying the training context renders any differences in quality across any of these categories. Ideally we will be able to identify approaches to training that increase precision in some specific linguistic categories, which will allow us to investigate whether these improvements can be combined by joining the information used in different training approaches to build a single, improved, model.
Deep Learning approaches for Hate speech detection. In this work we used the two deep learning approaches DCNN and MLP two separate classifier on four publicly available datasets.
The session focused on Data Mining using R Language where I analyzed a large volume of text files to find out some meaningful insights using concepts like DocumentTermMatrix and WordCloud.
Author Identification of Source Code Segments Written by Multiple Authors Usi...Parvez Mahbub
Authors:
Parvez Mahbub
Department of Computer Science
Dalhousie University
parvezmrobin@dal.ca
Naz Zarreen Oishie
Department of Computer Science
University of Saskatchewan
naz.oishie@usask.ca
S.M. Rafizul Haque
CSE Discipline
Khulna University
rafizul@cse.ku.ac.bd
AOTO: Adaptive overlay topology optimization in unstructured P2P systemsZhenyun Zhuang
IEEE GLOBECOM 2003
Peer-to-Peer (P2P) systems are self-organized and
decentralized. However, the mechanism of a peer randomly
joining and leaving a P2P network causes topology mismatch-
ing between the P2P logical overlay network and the physical
underlying network. The topology mismatching problem brings
great stress on the Internet infrastructure and seriously limits
the performance gain from various search or routing tech-
niques. We propose the Adaptive Overlay Topology Optimiza-
tion (AOTO) technique, an algorithm of building an overlay
multicast tree among each source node and its direct logical
neighbors so as to alleviate the mismatching problem by choos-
ing closer nodes as logical neighbors, while providing a larger
query coverage range. AOTO is scalable and completely dis-
tributed in the sense that it does not require global knowledge
of the whole overlay network when each node is optimizing the
organization of its logical neighbors. The simulation shows that
AOTO can effectively solve the mismatching problem and re-
duce more than 55% of the traffic generated by the P2P system itself.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
With the rise of Web 2.0, API-based software has appeared. This article examines the API-based search tool created for the Korean search engine Naver: Webonaver (Webometrics Tool for Naver). The software is able to collect large amounts of data automatically and can easily distinguish between different types of information on the web, which was impossible before. In particular, Internet researchers can improve efficiency of data analysis within a specified timeframe using this tool. This paper illustrates how to use WeboNaver and tries to verify the usability and reliability through several case studies. In this article, Korean National Assembly Members’ web presence was analyzed, as was the web presence of the term H1N1.
Web 2.0의 도래와 함께 Open API를 응용한 소프트웨어 프로그램이 등장하면서 더 이상 사용자들은 웹에서 정보를 수동으로 검색하면서 일일이 살펴보는 번거로움을 겪지 않아도 된다. 공개된 API를 활용해 몇 번의 간단한 조작으로 방대한 데이터를 체계적으로 수집하고 관리할 수 있다. 본 논문은 Open API를 응용해 개발한 검색전문 프로그램 WeboNaver(Webometrics Tool for Naver)를 소개한다. 이는 한국에서 가장 영향력 있는 검색엔진 중의 하나인 네이버를 이용해 방대한 데이터를 카테고리별로 자동수집하여 저장해주는 프로그램이다. 연구자들은 이를 활용해 데이터 관리와 처리, 분석 과정에 정확성과 고도의 효율성을 기할 수 있을 것이다. 논문의 목적은 WeboNaver의 사용을 원하는 학생, 일반인, 연구자의 이해를 돕고자 실제 사례들을 통하여 분석절차를 구체적으로 제시해 그 유용성을 입증하는 것이다. 이 프로그램을 사용하여 18대 국회의원 292명의 웹가시성을 조사하였다. 또한 신종플루와 관련된 단어들의 웹 가시성을 분석하였다.
Network Visualization guest lecture at #DataVizQMSS at @Columbia / #SNA at PU...Denis Parra Santander
- First version was a guest lecture about Network Visualization in the class "Data Visualization" taught by Dr. Sharon Hsiao in the QMSS program at Columbia University http://www.columbia.edu/~ih2240/dataviz/index.htm
- This updated version was delivered in our class on SNA at PUC Chile in the MPGI master program.
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...Databricks
In real-life applications, we often deal with situations where analysis needs to be conducted on graphs where the nodes and edges are associated with multiple labels. For example, in a graph that represents user activities in social networks, the labels associated with nodes may indicate their membership in communities (e.g. group, school, company, etc.), and the labels associated with edges may denote types of activities (e.g. comment, like, share, etc.). The current GraphX library in Spark does not directly support efficient calculation on the label-defined subgraph analysis and computations.
In this session, the speakers will propose a general API library that is able to support analysis on multi-label graphs, and can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions. can be reused and extended to design more complicated algorithms. It includes a method to create multi-label graphs and calculate basic statistics and metrics at both the global and subgraph level. Common graph algorithms, such as PageRank, can also be efficiently implemented in a parallel scheme by reusing the module/algorithm in GraphX, such as Pregel API.
See how LinkedIn is able to leverage this tool to efficiently find top LinkedIn feed influencers in different communities and by different actions.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
Detection of Related Semantic Datasets Based on Frequent Subgraph MiningMikel Emaldi Manrique
We describe an approach to find similarities between RDF datasets, which may be applicable to tasks such as link discovery, dataset summarization or dataset understanding. Our approach builds on the assumption that similar datasets should have a similar structure and include semantically similar resources and relationships. It is based on the combination of Frequent Subgraph Mining (FSM) techniques, used to synthesize the datasets and find similarities among them. The result of this work can be applied for easing the task of data interlinking and for promoting data reusing in the Semantic Web.
Full paper at: http://memaldi.github.io/pdf/iesd2015.pdf
Hierarchical clustering in Python and beyondFrank Kelly
Clustering of data is an increasingly important task for many data scientists. This talk will explore the challenge of hierarchical clustering of text data for summarisation purposes. We'll take a look at some great solutions now available to Python users including the relevant Scikit Learn libraries, via Elasticsearch (with the carrot2 plugin), and check out visualisations from both approaches.
https://www.youtube.com/watch?v=KFs9pBAetOo
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
In this talk we evaluate Apache Spark for a data-intensive machine learning problem. Our use case focuses on policy diffusion detection across the state legislatures in the United States over time. Previous work on policy diffusion has been unable to make an all-pairs comparison between bills due to computational intensity. As a substitute, scholars have studied single topic areas.
We provide an implementation of this analysis workflow as a distributed text processing pipeline with Spark ML and GraphFrames.
Histogrammar package—a cross-platform suite of data aggregation primitives for making histograms, calculating descriptive statistics and plotting in Scala—is introduced to enable interactive data analysis in Spark REPL.
We discuss the challenges and strategies of unstructured data processing, data formats for storage and efficient access, and graph processing at scale.
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
High-performance computing resources are currently widely used in science and engineering areas. Typical post-hoc approaches use persistent storage to save produced data from simulation, thus reading from storage to memory is required for data analysis tasks. For large-scale scientific simulations, such I/O operation will produce significant overhead. In-situ/in-transit approaches bypass I/O by accessing and processing in-memory simulation results directly, which suggests simulations and analysis applications should be more closely coupled. This paper constructs a flexible and extensible framework to connect scientific simulations with multi-steps machine learning processes and in-situ visualization tools, thus providing plugged-in analysis and visualization functionality over complex workflows at real time. A distributed simulation-time clustering method is proposed to detect anomalies from real turbulence flows.
Keynote given at the workshop for Artificial Intelligence meets the Web of Data on Pragmatic Semantics.
In this keynote I argue that the Web of Data is a Complex System or Marketplace of Ideas rather than a classical Database, and that the model theory on which classical semantics are based is not appropriate in all situations, and propose an alternative "Pragmatic Semantics" based on optimisation of possible interpretations. .
Transforming AI with Graphs: Real World Examples using Spark and Neo4jDatabricks
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Transforming AI with Graphs: Real World Examples using Spark and Neo4jFred Madrid
Graphs – or information about the relationships, connection, and topology of data points – are transforming machine learning. We’ll walk through real world examples of how to get transform your tabular data into a graph and how to get started with graph AI. This talk will provide an overview of how we to incorporate graph based features into traditional machine learning pipelines, create graph embeddings to better describe your graph topology, and give you a preview of approaches for graph native learning using graph neural networks. We’ll talk about relevant, real world case studies in financial crime detection, recommendations, and drug discovery. This talk is intended to introduce the concept of graph based AI to beginners, as well as help practitioners understand new techniques and applications. Key take aways: how graph data can improve machine learning, when graphs are relevant to data science applications, what graph native learning is and how to get started.
Supporting Concept Search using a Clinical Healthcare Knowledge GraphSujit Pal
We describe our Dictionary based Named Entity Recognizer and Semantic Matcher that enables us to leverage our Knowledge Graph to provide Concept Search. We also describe our Named Entity Linking based Concept Recommender to support manual curation of our Knowledge Graph.
Youtube URL for talk: https://youtu.be/5UWrS_j8dDg
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
Slides accompanying project submission video for Google AI Hackathon. Describes a LCEL and DSPy based evaluation framework inspired by the RAGAS project.
Accompanying video URL: https://youtu.be/yOIU65chc98
Building Learning to Rank (LTR) search reranking models using Large Language ...Sujit Pal
Search engineers have many tools to address relevance. Older tools are typically unsupervised (statistical, rule based) and require large investments in manual tuning effort. Newer ones involve training or fine-tuning machine learning models and vector search, which require large investments in labeling documents with their relevance to queries.
Learning to Rank (LTR) models are in the latter category. However, their popularity has traditionally been limited to domains where user data can be harnessed to generate labels that are cheap and plentiful, such as e-commerce sites. In domains where this is not true, labeling often involves human experts, and results in labels that are neither cheap nor plentiful. This effectively becomes a roadblock to adoption of LTR models in these domains, in spite of their effectiveness in general.
Generative Large Language Models (LLMs) with parameters in the 70B+ range have been found to perform well at tasks that require mimicking human preferences. Labeling query-document pairs with relevance judgements for training LTR models is one such task. Using LLMs for this task opens up the possibility of obtaining a potentially unlimited number of query judgment labels, and makes LTR models a viable approach to improving the site’s search relevancy.
In this presentation, we describe work that was done to train and evaluate four LTR based re-rankers against lexical, vector, and heuristic search baselines. The models were a mix of pointwise, pairwise and listwise, and required different strategies to generate labels for them. All four models outperformed the lexical baseline, and one of the four models outperformed the vector search baseline as well. None of the models beat the heuristics baseline, although two came close – however, it is important to note that the heuristics were built up over months of trial and error and required familiarity of the search domain, whereas the LTR models were built in days and required much less familiarity.
The ability to handle long question style queries is often de rigueur for modern search engines. Search giants such as Bing and Google are addressing this by building Large Language Models (LLMs) into their search pipelines. Unfortunately, this approach requires large investments in infrastructure and involves high operational costs. It can also lead to loss of confidence when the LLM hallucinates non-factual answers.
A best practice for designing search pipelines is to make the search layer as cheap and fast as possible, and move heavyweight operations into the indexing layer. With that in mind, we present an approach that combines the use of LLMs during indexing to generate questions from passages, and matching them to incoming questions during search, using either text based or vector based matching. We believe this approach can provide good quality question answering capabilities for search applications and address the cost and confidence issues mentioned above.
Vector search goes far beyond just text, and, in this interactive workshop, you will learn how to use it for multimodal search through an in-depth look at CLIP, a vision and language model, developed by OpenAI. Sujit Pal, technology research director at Elsevier, and Raphael Pisoni, senior computer vision engineer at Partium.io, will walk you through two applications of image search and then have a panel discussion with our staff developer advocate, James, on how to use CLIP for image and text search.
Learning a Joint Embedding Representation for Image Search using Self-supervi...Sujit Pal
Image search interfaces either prompt the searcher to provide a search image (image-to-image search) or a text description of the image (text-to-image search). Image to Image search is generally implemented as a nearest neighbor search in a dense image embedding space, where the embedding is derived from Neural Networks pre-trained on a large image corpus such as ImageNet. Text to image search can be implemented via traditional (TF/IDF or BM25 based) text search against image captions or image tags.
In this presentation, we describe how we fine-tuned the OpenAI CLIP model (available from Hugging Face) to learn a joint image/text embedding representation from naturally occurring image-caption pairs in literature, using contrastive learning. We then show this model in action against a dataset of medical image-caption pairs, using the Vespa search engine to support text based (BM25), vector based (ANN) and hybrid text-to-image and image-to-image search.
The power of community: training a Transformer Language Model on a shoestringSujit Pal
I recently participated in a community event to train an ALBERT language model for the Bengali language. The event was organized by Neuropark, Hugging Face, and Yandex Research. The training was done collaboratively in a distributed manner using free GPU resources provided by Colab and Kaggle. Volunteers were recruited on Twitter and project coordination happened on Discord. At its peak, there were approximately 50 volunteers from all over the world simultaneously engaged in training the model. The distributed training was done on the Hivemind platform from Yandex Research, and the software to train the model in a data-parallel manner was developed by Hugging Face. In this talk I provide my perspective of the project as a somewhat curious participant. I will describe the Hivemind platform, the training regimen, and the evaluation of the language model on downstream tasks. I will also cover some challenges we encountered that were peculiar to the Bengali language (and Indic languages in general).
Accelerating NLP with Dask and Saturn CloudSujit Pal
Slides for talk delivered at NY NLP Meetup. Abstract -- Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines. This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. The pipeline was built and executed on Saturn Cloud, a platform that makes it easy to launch and manage Dask clusters. The talk will present an introduction to Dask and explain how users can easily accelerate Python and NLP code across clusters of machines.
Accelerating NLP with Dask on Saturn Cloud: A case study with CORD-19Sujit Pal
Python has a great ecosystem of tools for natural language processing (NLP) pipelines, but challenges arise when data sizes and computational complexity grows. Best case, a pipeline is left to run overnight or even over several days. Worst case, certain analyses or computations are just not possible. Dask is a Python-native parallel processing tool that enables Python users to easily scale their code across a cluster of machines.
This talk presents an example of an NLP entity extraction pipeline using SciSpacy with Dask for parallelization, which was built and executed on Saturn Cloud. Saturn Cloud is an end-to-end data science and machine learning platform that provides an easy interface for Python environments and Dask clusters, removing many barriers to accessing parallel computing. This pipeline extracts named entities from the CORD-19 dataset, using trained models from the SciSpaCy project, and makes them available for downstream tasks in the form of structured Parquet files. We will provide an introduction to Dask and Saturn Cloud, then walk through the NLP code.
Leslie Smith's Papers discussion for DL Journal ClubSujit Pal
This deck discusses two papers by Dr Leslie Smith. The first paper discusses empirical findings around learning rate (LR) and other regularization parameters for neural networks, and leads to the idea of Cyclic Learning Rates (CLR). The second paper discusses CLR in depth, as well as how to estimate its parameters. The slides also covers LR Finder, a tool first introduced in the Fast.AI library to find optimal parameters for CLR, including how to run it and interpret its outputs.
Using Graph and Transformer Embeddings for Vector Based RetrievalSujit Pal
For the longest time, term-based vector representations based on whole-document statistics, such as TF-IDF, have been the staple of efficient and effective information retrieval. The popularity of Deep Learning over the past decade has resulted in the development of many interesting embedding schemes. Like term-based vector representations, these embeddings depend on structure implicit in language and user behavior. Unlike them, they leverage the distributional hypothesis, which states that the meaning of a word is determined by the context in which it appears. These embeddings have been found to better encode the semantics of the word, compared to term-based representations. Despite this, it has only recently become practical to use embeddings in Information Retrieval at scale.
In this presentation, we will describe how we applied two new embedding schemes to Scopus, Elsevier’s broad coverage database of scientific, technical, and medical literature. Both schemes are based on the distributional hypothesis but come from very different backgrounds. The first embedding is a graph embedding called node2vec, that encodes papers using citation relationships between them as specified by their authors. The second embedding leverages Transformers, a recent innovation in the area of Deep Learning, that are essentially language models trained on large bodies of text. These two embeddings exploit the signal implicit in these data sources and produce semantically rich user and content-based vector representations respectively. We will evaluate these embedding schemes and describe how we used the Vespa search engine to search these embeddings for similar documents within the Scopus dataset. Finally, we will describe how RELX staff can access these embeddings for their own data science needs, independent of the search application.
Transformer Mods for Document Length InputsSujit Pal
The Transformer architecture is responsible for many state of the art results in Natural Language Processing. A central feature behind its superior performance over Recurrent Neural Networks is its multi-headed self-attention mechanism. However, the superior performance comes at a cost, an O(n2) time and memory complexity, where n is the size of the input sequence. Because of this, it is computationally infeasible to feed large documents to the standard transformer. To overcome this limitation, a number of approaches have been proposed, which involve modifying the self-attention mechanism in interesting ways.
In this presentation, I will describe the transformer architecture, and specifically the self-attention mechanism, and then describe some of the approaches proposed to address the O(n2) complexity. Some of these approaches have also been implemented in the HuggingFace transformers library, and I will demonstrate some code for doing document level operations using one of these approaches.
Question Answering as Search - the Anserini Pipeline and Other StoriesSujit Pal
In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering.
To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components -- a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores.
Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor.
The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.
Building Named Entity Recognition Models Efficiently using NERDSSujit Pal
Named Entity Recognition (NER) is foundational for many downstream NLP tasks such as Information Retrieval, Relation Extraction, Question Answering, and Knowledge Base Construction. While many high-quality pre-trained NER models exist, they usually cover a small subset of popular entities such as people, organizations, and locations. But what if we need to recognize domain specific entities such as proteins, chemical names, diseases, etc? The Open Source Named Entity Recognition for Data Scientists (NERDS) toolkit, from the Elsevier Data Science team, was built to address this need.
NERDS aims to speed up development and evaluation of NER models by providing a set of NER algorithms that are callable through the familiar scikit-learn style API. The uniform interface allows reuse of code for data ingestion and evaluation, resulting in cleaner and more maintainable NER pipelines. In addition, customizing NERDS by adding new and more advanced NER models is also very easy, just a matter of implementing a standard NER Model class.
Our presentation will describe the main features of NERDS, then walk through a demonstration of developing and evaluating NER models that recognize biomedical entities. We will then describe a Neural Network based NER algorithm (a Bi-LSTM seq2seq model written in Pytorch) that we will then integrate into the NERDS NER pipeline.
We believe NERDS addresses a real need for building domain specific NER models quickly and efficiently. NER is an active field of research, and the hope is that this presentation will spark interest and contributions of new NER algorithms and Data Adapters from the community that can in turn help to move the field forward.
Learning to Rank Presentation (v2) at LexisNexis Search GuildSujit Pal
An introduction to Learning to Rank, with case studies using RankLib with and without plugins provided by Solr and Elasticsearch. RankLib is a library of learning to rank algorithms, which includes some popular LTR algorithms such as LambdaMART, RankBoost, RankNet, etc.
Learning to Rank (LTR) presentation at RELX Search Summit 2018. Contains information about history of LTR, taxonomy of LTR algorithms, popular algorithms, and case studies of applying LTR using the TMDB dataset using Solr, Elasticsearch and without index support.
Search summit-2018-content-engineering-slidesSujit Pal
Slides accompanying content engineering tutorial presented at RELX Search Summit 2018. Contains techniques for keyword extraction using various statistical, rule based and machine learning methods, keyword de-duplication using SimHash and Dedupe, and dimensionality reduction techniques such as Topic Modeling, NMF, Word vectors, etc.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Evolving a Medical Image Similarity SearchSujit Pal
Slides for talk at Haystack Conference 2018. Covers evolution of an Image Similarity Search Proof of Concept built to identify similar medical images. Discusses various image vectorizing techniques that were considered in order to convert images into searchable entities, an evaluation strategy to rank these techniques, as well as various indexing strategies to allow searching for similar images at scale.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
2. #Graphorum
Who am I?
• (Mostly self taught) data scientist
• Work at Elsevier Labs
• Worked with Deep Learning, Machine Learning, Natural Language
Processing, Search, Backend Web Development, Database
Administration, and Unix System Administration in reverse
chronological order.
• Took Graph Theory in college
• Rekindled interest after Social Network Analysis course on Coursera
• Interested in applications of Graph techniques to NLP
2
4. #Graphorum
Typical NLP + Graph problems
• Represent text units as nodes and (similarity based) relationships as
edges in graph
• Leverage intrinsic or extrinsic graphical structure of data
• Intrinsic – co-citations and co-mentions in academic graph
• Extrinsic – text data from social networks
• Leverage external graph structure such as Knowledge Graph to
improve results for NLP task
4
5. #Graphorum
Case Studies
• Summarization using network metrics
• Document Clustering using Random Walk
• Word Sense Disambiguation using Label Propagation
• Incorporating external knowledge for Topic Identification
5
6. #Graphorum
Matrices and Graphs are Interchangeable
6
• Text elements => vectors
• Collection of elements =>
matrix
• Similarity = operation on
pairwise rows of matrix
• Convert to graph
• Graph Methods!
8. #Graphorum
Case Study #1: Steps
• Create graph – sentences are nodes, edges connect sentences that
share common meaningful nouns
• Develop 14 summarizers (CN-SUMM) based on various graph metrics,
each summarizer produces a ranked list of sentences
• Voting based ensemble (CN-VOTING), ranks sentences with sum of
rankings from each of the 14 summarizers
• Return top ranked sentences from CN-VOTING as summary
8
9. #Graphorum
Case Study #1: Implementation
• Extract common nouns from sentences and compute similarity as
overlap
• Construct graph of sentences
• Compute Degree, Strength, Closeness, and PageRank centrality scores
per node, Shortest Path from each node to every other node, D-Ring,
k-Core, and w-Cuts, determine K most central nodes by each measure
• Ensemble predictions using Voting to produce summary sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/01-
doc-summarization
9
10. #Graphorum
Case Study #1: Degree and Strength
10
• Degree – number of edges incident on
a vertex, measured by Degree
Centrality
• Strength – sum of edge weights
incident on the vertex, measured by
Weighted Degree Centrality
11. #Graphorum
Case Study #1: Closeness
11
• Closeness Centrality measures how
efficiently a vertex is able to spread
information across the network
• Defined as average “farness” (inverse
distance) to all other nodes
12. #Graphorum
Case Study #1: PageRank
12
• Popularized by Google’s Brin and Page
• Quality and number of in-links to a page is
rough estimate of page quality
• Iterative procedure, until convergence
• Starts with all nodes having same rank
• “Surfer” starts on random page
• Chooses a page randomly from among its
outlinks
• With probability (1-d) (d=0.15 for web)
jump to some random page on web
13. #Graphorum
Case Study #1: Shortest Paths
13
• Mean shortest path from each node to every other node
• Compute all-pairs shortest paths
• Algorithm uses linear number of matrix multiplication
• Order is O(V4)
• Introduced by Shimbel (1953)
• Compute mean shortest path from each node to all other nodes
• An indirect measure of centrality
14. #Graphorum
Case Study #1: D-Ring
14
• Create subgraphs by dilating
• Start with highest degree (or position)
• D-ring is difference of subgraphs created
by consecutive dilations
• Continue to add D-rings until enough
nodes available for summary
• Two measures of centrality – CN-RingL
and CN-RingK
15. #Graphorum
Case Study #1: K-Core
15
• Create subgraph starting with node with
highest degree (or position) k
• Relax threshold for k and continue adding
nodes until there are enough nodes for the
summary
• Two measures of centrality based on
degree or position – CN-CoreK and CN-
CoreL
16. #Graphorum
Case Study #1: W-Cuts
16
• Create subgraph starting with node
pair with highest edge weight
• Relax edge weight threshold and
continue adding nodes until enough
nodes available for summary
• Two measures of centrality – CN-CutK
and CN-CutL, based on preference
given to position or degree
18. #Graphorum
Case Study #1: Closing Thoughts
• Generated summaries are good, but biased towards longer sentences
• Strategy described above can be extended to multi-document
summarization as well, e.g., summary of product reviews.
• A variant of the strategy described is used in the gensim summarizer.
18
21. #Graphorum
Case Study #2: Steps
• Paper asserts that Language Model based graph is more effective for clustering
than TD matrix based graph
• Represent each document in corpus as a node, edges connect documents by
cosine similarity of TF-IDF document vectors
• Compute t-step (t=1, 2, 3) random walks for each node, considering only top k
edges (for k ~ 80), and compute generation probabilities
• Cluster resulting graph of document generation probabilities with k-means and
Louvain Modularity
21
22. #Graphorum
Case Study #2: Implementation
• 20-Newsgroup dataset (18k newsgroup postings, 20 categories)
• Clean text and construct TD matrix
• Construct cosine similarity matrix S, sparsify using top generators (c=80), remove self-edges, and
renormalize
• Run (c=80) random walks on each node for path length = 1, 2, 3
• Compute empirical transition probability matrix (language model!) G from walks
• Construct graph, apply Louvain Community Detection on various graphs
• Compare against K-Means clusters from various document vectors
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/02-docs-clustering
22
23. #Graphorum
Case Study #2: TD Matrix to Cosine Similarity
23
Image Credit: https://www.quora.com/What-is-a-tf-idf-vector
• Documents represented as TD Matrix
(n documents x t features)
• Similarity Matrix (n x n) = TD Matrix
(n x t) times its transpose (t x n),
divided by |T| to keep similarity
values in range (0, 1)
24. #Graphorum
Case Study #2: Random Walks
24
Image Credit: https://snap.stanford.edu/node2vec/
• Probabilistic technique used to
“flatten” graph into feature vector
• Intuition – similar nodes are closer to
each other in the graph than
dissimilar nodes
• Compute empirical generation
probabilities
• Other popular applications --
DeepWalk and node2vec
25. #Graphorum
Case Study #2: Louvain Modularity
25
• Community Detection Algorithm – maximize
modularity score for each community
• Modularity = difference between actual number
of edges between node pair and expected
number of edges, summed over all nodes in
community
• Iterative procedure, run till convergence
• Greedily assign nodes to communities,
optimizing local modularity
• Define a coarse grained network of
communities
26. #Graphorum
Case Study #2: Results
• Silhouette Score = tightness /
separation
• Baseline – TD + K-means + Labels close
to 0
• G1, G2, G3 – LM Matrices for n=1,
n={1,2}, and n={1,2,3}.
• LM based graphs outperform TD
matrix based graphs
• Louvain outperforms K-Means
26
27. #Graphorum
Case Study #2: Closing Thoughts
• Transforming the graph to have edges based on transition
probabilities based on random walks yields better clustering results.
• Random Walks on graph structures often used to “flatten” the graph
and expose higher-order proximity dependencies that can sometimes
look like semantic similarity
• Community Detection algorithms can be used for clustering, and
often produce more explainable clusters
27
30. #Graphorum
Case Study #3: Steps
• Choose ambiguous word of interest (https://muse.dillfrog.com/lists/ambiguous)
• Find sentences containing ambiguous word from large corpus
• Manually assign labels to some sentences
• Featurize each sentence using POS of neighboring words, unigrams, and local
collocations
• Create graph with sentences as nodes, edges weighted by cosine similarity and JS
divergence of feature vectors
• Propagate Labels till convergence
• Generate word sense clusters
30
31. #Graphorum
Case Study #3: Implementation
• We selected the ambiguous word “compound” with these 2 senses
• Chemical compound
• Composite or multiple
• Extracted 670 sentences containing “compound” from SD corpus
• Manually marked up 40 total sentences (19 + 21) ~ 5% of corpus
• Created TD matrix of 1..3 grams + 3-gram POS tags, sparsified (k=5), removed self-
edges, and created graph
• Run Label Propagation to propagate the 40 labels to unlabeled sentences
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/03-word-sense-
disambiguation
31
32. #Graphorum
Case Study #3: Label Propagation
32
• Label Propagation uses network structure to detect
communities.
• Used here in semi-supervised manner by specifying
labels for a small subset of nodes
• Iterative algorithm
• Initialize nodes each with unique label
• Each node updates its labels to the most frequent
label of its neighbors
• Converges when each node has the most
frequent label of its neighbors
• Not guaranteed to converge!
33. #Graphorum
Case Study #3: Results
• Of 623 unlabeled sentences, Label Propagation predicts 319 sentences use the
first sense (chemical compound), 7 use the second sense (composite), and misses
298
• Misses are mostly chemical compounds (sense 1)
• Examples:
• Sense #1: ORTEP view of the compound [CuL8(ClO4)2] with the numbering
scheme adopted.
• Sense #2: Sensitive to compound fluorescence.
• Results can probably be improved – tried increasing initial labels, and by starting
with denser networks (so LP does not terminate as quickly)
33
36. #Graphorum
Case Study #4: Steps
• Build up in-memory graph structure for Knowledge Graph (KG)
• Match phrases in document against KG entries
• Compute Personalized PageRank (PPR) biased to matched nodes
• Roll up top scored concepts from PPR to category concepts
• Report top category concepts as document topics
36
37. #Graphorum
Case Study #4: Implementation
• Annotate ScienceDaily article against Aho-Corasick dictionary of KG concepts
• Using company proprietary KG to build graph, 2 versions
• Lateral relations only
• isChildOf (child -> parent) relations only
• Run Personalized PageRank (PPR) against Lateral Relations graph setting source
nodes to concepts found in article
• Roll up high PPR score concepts to disease category concepts
• Top disease category concepts are document topic labels
• See https://github.com/sujitpal/nlp-graph-examples/tree/master/04-topic-
identification
37
38. #Graphorum
Case Study #4: Aho-Corasick Matching
38
• Inverted index of terms to
concept ID stored in trie-like
data structure, where every
node is a token in phrase
representing concept name
• Document streamed against
this data structure to produce
list of phrases in document
matched against concepts in
dictionary
Image Credit: https://brunorb.com/aho-corasick/
39. #Graphorum
Case Study #4: Personalized PageRank
39
• In PageRank, surfer doing random walk on graph jumps to some random point in
the graph with some probability d (d=0.15 for web graphs)
• In Personalized PageRank (PPR), surfer will jump to a neighborhood of the graph
specified by a set of nodes (source nodes)
• Overall effect is to assign high PPR to nodes that are in close proximity to the
source nodes.
• Personalized PageRank has been found to be an effective measure for
recommendation systems
40. #Graphorum
Case Study #4: Disease Categories
40
• Disease Category Concepts are
children of Diseases Concept
• Navigate to parent from
Discovered Concepts until a
Disease Category node is found
(or no parents are found)
• Roll up discovered concepts to
their Disease Categories – these
are the Document Topics
42. #Graphorum
Case Study #4: Closing Thoughts
• Topic predictions from rolling up high PPR concepts are serendipitous,
but not necessarily complete
• Better results if combined with topic predictions obtained from rolling
up concepts found in article
42
44. #Graphorum
Summing up
• Content features and graph structure often reinforce each other
• Can be useful for unsupervised and semi-supervised NLP tasks
• Not necessarily an either-or – BERT based models can coexist with
Graph techniques
44
45. #Graphorum
Tools
• Originally planned to use Spark + GraphFrames for large graphs and
Neo4j for small / medium graphs
• Neo4j worked well for largest graph (500 K nodes, 1.3 M edges)
• Neo4j algorithms frequently have more functionality
• Allows multiple source nodes for Parallel PageRank
• Allows weighted edges in Label Propagation
• Ended up using Neo4j for all case studies
45