This document discusses approaches for visualizing comparisons of word embeddings. It begins by describing the challenges of interpreting word embeddings, which reside in high-dimensional spaces with complex relationships between words. It then reviews common tasks for analyzing word embeddings, such as identifying similar words and semantic directions. Existing visualization techniques are limited in supporting these tasks. The document proposes three new visualization designs: Buddy Plots to show local word similarities, Concept Axis Plots to reveal semantic directions between words, and Co-occurrence Matrices to display predictive relationships. These designs aim to help interpret word embeddings and guide linguistic analysis.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.
Towards a Distributional Semantic Web StackAndre Freitas
The ability of distributional semantic models (DSMs) to dis-
cover similarities over large scale heterogeneous and poorly structured data brings them as a promising universal and low-effort framework to support semantic approximation and knowledge discovery. This position paper explores the role of distributional semantics in the Semantic Web vision, based on the state-of-the-art distributional-relational models, categorizing and generalizing existing approaches into a Distributional Semantic Web stack.
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...Amit Sheth
Invited Talk, International Workshop on Ontology Matching
collocated with the 5th International Semantic Web Conference
ISWC-2006, November 5, 2006, Athens GA
Latent Semantic Analysis (LSA) is a mathematical technique for computationally modeling the meaning of words and larger units of texts. LSA works by applying a mathematical technique called Singular Value Decomposition (SVD) to a term*document matrix containing frequency counts for all words found in the corpus in all of the documents or passages in the corpus. After this SVD application, the meaning of a word is represented as a vector in a multidimensional semantic space, which makes it possible to compare word meanings, for instance by computing the cosine between two word vectors.
LSA has been successfully used in a large variety of language related applications from automatic grading of student essays to predicting click trails in website navigation. In Coh-Metrix (Graesser et al. 2004), a computational tool that produces indices of the linguistic and discourse representations of a text, LSA was used as a measure of text cohesion by assuming that cohesion increases as a function of higher cosine scores between adjacent sentences.
Besides being interesting as a technique for building programs that need to deal with semantics, LSA is also interesting as a model of human cognition. LSA can match human performance on word association tasks and vocabulary test. In this talk, Fridolin will focus on LSA as a tool in modeling language acquisition. After framing the area of the talk with sketching the key concepts learning, information, and competence acquisition, and after outlining presuppositions, an introduction into meaningful interaction analysis (MIA) is given. MIA is a means to inspect learning with the support of language analysis that is geometrical in nature. MIA is a fusion of latent semantic analysis (LSA) combined with network analysis (NA/SNA). LSA, NA/SNA, and MIA are illustrated by several examples.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
More Related Content
Similar to Interpreting Embeddings with Comparison
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.
Towards a Distributional Semantic Web StackAndre Freitas
The ability of distributional semantic models (DSMs) to dis-
cover similarities over large scale heterogeneous and poorly structured data brings them as a promising universal and low-effort framework to support semantic approximation and knowledge discovery. This position paper explores the role of distributional semantics in the Semantic Web vision, based on the state-of-the-art distributional-relational models, categorizing and generalizing existing approaches into a Distributional Semantic Web stack.
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...Amit Sheth
Invited Talk, International Workshop on Ontology Matching
collocated with the 5th International Semantic Web Conference
ISWC-2006, November 5, 2006, Athens GA
Latent Semantic Analysis (LSA) is a mathematical technique for computationally modeling the meaning of words and larger units of texts. LSA works by applying a mathematical technique called Singular Value Decomposition (SVD) to a term*document matrix containing frequency counts for all words found in the corpus in all of the documents or passages in the corpus. After this SVD application, the meaning of a word is represented as a vector in a multidimensional semantic space, which makes it possible to compare word meanings, for instance by computing the cosine between two word vectors.
LSA has been successfully used in a large variety of language related applications from automatic grading of student essays to predicting click trails in website navigation. In Coh-Metrix (Graesser et al. 2004), a computational tool that produces indices of the linguistic and discourse representations of a text, LSA was used as a measure of text cohesion by assuming that cohesion increases as a function of higher cosine scores between adjacent sentences.
Besides being interesting as a technique for building programs that need to deal with semantics, LSA is also interesting as a model of human cognition. LSA can match human performance on word association tasks and vocabulary test. In this talk, Fridolin will focus on LSA as a tool in modeling language acquisition. After framing the area of the talk with sketching the key concepts learning, information, and competence acquisition, and after outlining presuppositions, an introduction into meaningful interaction analysis (MIA) is given. MIA is a means to inspect learning with the support of language analysis that is geometrical in nature. MIA is a fusion of latent semantic analysis (LSA) combined with network analysis (NA/SNA). LSA, NA/SNA, and MIA are illustrated by several examples.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
2. Two talks (or five) in one!
How do we deal with
embeddings of text data?
Alexander, et. al. Serendip: Topic Model-Driven
Visual Exploration of Text Corpora.VAST ‘14.
Alexander & Gleicher. Task-Driven Comparison
of Topic Models.VAST ‘15
Heimerl & Gleicher. Interactive Analysis of
Word Vector Embeddings. EuroVis ’18.
Heimerl, et al. Interactive Visual Comparison of
Object Embeddings. (TVCG ‘21)
How do we use comparison as a
tool for hard data problems?
Gleicher. Considerations for Visualizing
Comparison. InfoVis 2017.
3. How do we deal with embeddings
of text data?
Alexander, et. al. Serendip: Topic Model-
Driven Visual Exploration of Text Corpora.
VAST ‘14.
Alexander & Gleicher. Task-Driven Comparison
of Topic Models.VAST ‘15
Heimerl & Gleicher. Interactive Analysis of
Word Vector Embeddings. EuroVis ’18.
Heimerl, et al. Interactive Visual Comparison of
Object Embeddings.
TVCG ‘21
Eric Alexander
Assistant Prof.
Carleton College
Florian Heimerl
Post-Doc
University ofWisconsin
4.
5. Michael Gleicher
University ofWisconsin
Visual Computing Group
Human Graphics Interaction
authoring pictures, videos, animations
Human Robot Interaction
robots!
Human Data Interaction
visualization, visual analytics, interactive learning
`
8. Two different stories…
Embeddings are a great way to
analyze text collections!
But they are hard to interpret!
Use comparison as a strategy
Comparison is a great way to
think about analysis problems!
But it’s abstract – need examples!
Use embeddings as a case study
10. What is an embedding?
General mathematics:
Place a smaller structure into a
larger structure
Computer science:
Place a discrete set of objects into
a vector space
Encode relationships between
objects
Paris
4,8,1,3,…
Atlanta
5,2,1,7,…
New York
4,8,1,3,…
London
5,2,1,7,…
Boston
3,2,5,1,…
Tokyo
9,2,6,4,…
Beijing
7,3,2,7,…
San Jose
5,2,1,7,…
Jakarta
3,2,5,1,…
Sydney
9,2,6,4,…
Munich
7,3,2,7,…
High Dimensional Data
Objects have associated Vectors
11. Kinds of relationships in embeddings
Distance
A is closer to B than to C
B
C
D
A
E
Linear Structure
A is to B as C is to D
Semantic Directions
A is more X than C
B
C
D
A
F
F
E
B
A
C
D
Relationships are interesting even if global positions are not
12. Embeddings
We care about relationships not values
The coordinates (axes) may have no meaning
A B
1
15. Embeddings for analyzingText Corpora
Embed Documents
Similar documents are close
Estimate similar meaning by
word usage statistics in a corpus
Embed Words
Similar words are close
Estimate similar meaning by
word usage statistics in a corpus
16. Document Similarity
Documents that use similar words
(probably) have similar content
Idea from the 1960s
https://www.nytimes.com/2019/01/02/obituaries/karen-sparck-jones-overlooked.html
17. Similarity based on word counts?
Need to deal with similar words (synonyms)
Need to reduce the number of words (too high dimensional)
Find groups of words
Pre-defined dictionaries (word types)
word usage analysis
Statistical groupings (words that tend to go together)
Topic Modeling
19. Word vector embeddings
Place words in a high-dimensional vector space
Words similar in meaning should be close in space
Infer similarity by distributional semantics:
similar context implies similar meaning
Construct embeddings by processing a corpus of text
my pet cat is brown
my pet dog is brown
my big car is brown
20. Why useWordVector Embeddings?
Learn about
Language or Corpora (Texts)
Find similar words/synonyms
Track changes of word usage
Exploring polysemy
Creating lexical resources
Evidence of bias
…
Natural Language Applications
Pre-Processing
Translation
Sentiment Analysis
Interpretation
…
21. Several ways to build word embeddings
Word2Vec
Skip-gram model
Neural embedding
GLoVE
Co-occurrence model
Factor matrix by optimization
Several ways to build topic models
(document embeddings)
LDA (Latent Dirichlet Allocation)
Standard algorithm
Iterative and probabilistic
NMF (Non-Negative Matrix Factors)
Fast Algorithm
Less widely adopted
22. Why Interpret Embeddings?
Gain insight into a model
Did you build a good model?
Gain insight into the modeling process
Which methods are better?
Gain insight about the underlying data
What does the model tell us?
23. Challenges ofWordVector Embeddings
Challenges of Document Embeddings
Large numbers ofWords (or documents)
High-dimensional spaces
Complex relationships – meaningless positions (or questionable)
Complex processes for building embeddings
Complex downstream applications
No ground truth - subjective aspects
24. Interactive Analysis ofWord
Vector Embeddings
Florian Heimerl and Michael Gleicher
Department of Computer Sciences
University ofWisconsin – Madison
EuroVis 2018
25. Summary:
Word vector embeddings offer unique challenges
Task analysis of needs 3 Designs for unmet needs
Buddy Plots
ConceptAxis Plots
Co-occurance Matrices
26. Task Analysis:
What do people do withWord Embeddings?
Literature survey
111 papers from diverse communities
Consider use cases in Linguistics, HCI, Digitial Humanities, etc.
Augment this list by extrapolation
what tasks would users want – but aren’t doing yet
27. Use cases suggest tasks
Learn about
Language or Corpora (Texts)
Find similar words/synonyms
Track changes of word usage
Exploring polysemy
Creating lexical resources
Natural Language Applications
Pre-Processing
Translation
Sentiment Analysis
Interpretation
Evaluation
Intrinsic (good embedding?)
Extrinsic (applications success?)
Interpretation
Identify items of interest
Probe values of interest
28. LinguisticTasks and Characteristics
We identified 7 distinct linguistic tasks within the literature
4 characteristics pertinent to those tasks: similarity, arithmetic
structures, concept axis, and co-occurrences
28
30. Tasks and Characteristics vs.Visualizations
Rank word pairs similarity
View neighbors similarity
Select synonyms similarity
Compare concepts average, similarity
Find analogies offset, similarity
Project on Concept concept axis
Predict Contexts co-oc. probability
Smilkov, et al 2016
Liu, et al 2017
31. Tasks and Characteristics vs.Visualizations
Rank word pairs similarity
View neighbors similarity
Select synonyms similarity
Compare concepts average, similarity
Find analogies offset, similarity
Project on Conceptconcept axis
Predict Contexts co-oc. probability
Smilkov, et al 2016
Liu, et al 2017
Tasks we seek designs for in this paper
36. Ways to define axes
Vector between two concepts
Interaxis - Kim et al., 2015
Classifier between two groups
Explainers – Gleicher, 2013
Food
Pet
Dog
Cat
Goldfish
Chicken
Cow
Trout
40. Similarities:
Understanding local distances
Distances are meaningful
even if absolute values are not
What is close to a word?
Are there groups of words that are similar?
Ordered lists are useful
Density (how many can you show)
Sense of relative distances
Comparison between words
Embedding Projector
Smilkov, et al. 2016
41. Buddy Plots (1D lists)
Alexander and Gleicher, 2016 – forTopic Models
Map distance (to selected reference) to horizontal axis
Reference
object
Closest
to ref
49. Comparison?
For interpretation
Do the differences show
something?
Word meaning change
(between corpora)
Correlations / Biases
(within corpora)
For modeling (selection)
How are these models different?
Which is better?
50. Comparison?
Comparison is an important task
in Data Analysis
Comparison is special!
(since it involves multiple things)
It’s an important special case
It deserves special attention
Almost all Data Analysis can be
viewed as comparison
Comparison is a lens
(to look at problems)
It’s a generally useful tool
It deserves special attention
51. How do I think about
comparison?
to help me develop tools to help people do it
55. What is this paper?
Considerations for Visualizing Comparison
4 questions to ask when designing a
visualization or tool ?
56. To examine (two or more
objects, ideas, people, etc.) in
order to note similarities and
differences
To mark or point out the
similarities and differences of
(two or more things)
57. What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
58. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
59. Wrong Question:
Is my problem Comparison?
Just about anything can be viewed as comparison
Not everything benefits from being viewed this way
60. Serendip, VAST ‘14
Alexander, E., Kohlmann, J., Valenza, R., Witmore, M., &
Gleicher, M. (2014). Serendip: Topic model-driven visual
exploration of text corpora. In 2014 IEEE Conference on
Visual Analytics Science and Technology (VAST)
61. Serendip: Topic model-driven visual
exploration of text corpora
Use a topic model to guide exploration of a text corpus
Find patterns and connect back to specifics
Documents Model Documents/Passages
63. Goals
Support inquiry across levels of abstraction
Combat issues of scale in the data
Many documents Long documents
64. Goals
Support inquiry across levels of abstraction
Combat issues of scale
Promote serendipitous discovery:
Multiple entry points
Highlight adjacency
Flexible exploration
A. Thudt, U. Hinrichs, S. Carpendale 2012
65. Basics of Serendip
Three different (interlinked views)
Corpus Document Passage Word
CorpusViewer TextViewer RankViewer
66.
67. Is this comparison?
No!
A tool for exploring a topic
model!
We didn’t describe it as comparison
Tool for looking at one topic model
Unclear how users think about it
Yes!
Comparison thinking really
helped!
We did think about comparison
Tool for using topic models
Our users had comparison tasks
68. Is this comparison? I don’t care!
No!
A tool for exploring a topic model!
We didn’t describe it as comparison
Tool for looking at one topic model
Unclear how users think about it
Yes!
Comparison thinking really helped!
We did think about comparison
Tool for using topic models
Our users had comparison tasks
A survey of comparison
would have missed this.
It’s a great example of
comparison ideas
69. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
70. Question 1:
What is the comparison?
What are the elements of the comparison?
71. The Elements of a Comparison . . .
Targets — Set of things being compared
Action — What to do with the relationship among them
To mark or point out the
similarities and differences of
(two or more things)
To examine
(two or more objects, ideas, etc.)
in order to note
similarities and differences
72. Question 1A:
The Elements: Targets
Do you know what you are comparing?
Explicit Comparisons – the system has the set of targets
Implicit Comparisons – the system may not know all the targets
compare against an implicit baseline
compare against the user’s knowledge
compare with targets only the user knows
73. Question 1A:
The Elements: Targets
What is being compared? – Comparison Targets
Does the model match my expectations?
What documents are similar?
How do groups of documents differ?
What words indicate these differences?
How are words used differently?
Where in texts are these differences?
Do the patterns match other things I know?
74. Question 1B:
The Elements: Actions
Verbs on relationships
Try to be more specific than “examine” or “compare”
Truth in Advertising: I didn’t have this worked out in 2010
75. Question 1B:
The Elements: Actions
What to do with the relationship? Comparison Actions (Verbs)
Does the model match my expectations? Measure/Quantify relationship
What documents are similar? Identify similar things
How do groups of documents differ? Measure/Quantify relationship
What words indicate these differences? Dissect a difference
How are words used differently? Identify meaningful differences
Where in texts are these differences? Contextualize the relationships
Do the patterns match other things I know? Identify similar things
76. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
77. Question 2:
Why is this comparison hard?
If it isn’t hard, you probably don’t need to think about it (much)
Abstractly
Too many targets to compare
Large or Complex Targets
Complex Relationships
Serendip
Challenges of Scale!
Lots of documents
Long / complex documents
Complicated models
78. Many different comparisons
Challenges from the kind (not scale)
• Hard target types (implicit)
• Hard action types (dissection)
• Hard combinations (dissect implicit)
Tasks influence scalability challenges
Solutions must respond to both!
Only Scalability Challenges?
79. Serendip Comparison Example:
Where does this happen? (contextualize)
Task Challenge:
Contextualize – fit user knowledge
Strategy: show in context
Design: use text as scaffold
Scalability Challenge:
Long Documents
Strategy: summarize
Design: overview + detail
81. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
82. Question 3:
What is your strategy for those challenges?
Abstractly
Scan Sequentially
Select Subset
Summarize Somehow
Sarikaya, Gleicher & Szafir. (2018). Design Factors
for Summary Visualization in Visual Analytics.
Computer Graphics Forum, 37(3), 145–156.
EuroVis 2018.
Scalability Strategies!
83. Serendip Comparison Example:
Compare Groups
Task Challenge:
Implicit targets – what groups?
Strategy: make explicit
Design: user specifies groups
Scalability Challenge:
Lots of documents
Strategy: summarize
Design: how to present statistics?
84. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
85. Question 4:
What Visual Design for Comparison?
Abstractly
Juxtaposition
Superposition
Explicit Encoding
Gleicher, M., Albers, D., Walker, R., Jusufi, I., Hansen, C. D., & Roberts, J. C. (2011).
Visual comparison for information visualization. Information Visualization,
10(4), 289–309.
86. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
88. If you’re so good at
comparison…
We should be able to compare complex things
89. Interactive Visual Comparison of
Object Embeddings
Florian Heimerl, Christoph Kralj,
Torsten Möller and Michael Gleicher
University of Wisconsin – Madison, University of Vienna
TVCG 2020 (presented at VIS ‘21)
90. The Problem:
Compare two embeddings
We are interested in the relationships between objects
local structure
Not their positions in space
global structure
91. back to the introductory example…
10 dimensional data mapped to 2 dimensions
867 objects
2 runs ofTSNE (different random seeds)
Perplexity 30, Learning Rate 200
Local structure (near neighbors) should be preserved
(that’s what the algorithm does)
94. Is the local structure really similar?
measure difference/similarity?
assess and localize it?
interpret and diagnose it?
identify exemplars?
understand context?
Metrics (multiple)
Summary views
Link between views
Connect to detail views
Connect back to global views
108. A real example:
WordVector Embeddings, 2 Coprora
Wikipedia (modern) vs EEBO (1470-1700)
GloVE (embedding algorithm)
300 dims (key parameter of algorithm)
5000 most common words (can’t show all words)
115. Another example:
Build topic models with abstracts?
Vispub data (871 papers)
Abstracts vs. FullText
NMF, 10 topics
Hypothesis: Abstracts are different than papers
116.
117. Dimensionality Reduction ofTopic Models
(abstracts vs. texts)
Red dots are the “topics” – 1 hot vectors of each topic (unit dimensions in each vector)
121. Limitations
Binary comparison – less good for parameter tuning
need new designs?
Many complex views that need to be combined
address usability – through task-driven pre-arrangements?
Scalability of implementation and visual designs
Validation on real problems with real users
122. Summary
Embeddings forText Analysis
If we can design interpretation tools
Compare document and word
embeddings to interpret them
Specialized tools for comparison
of embeddings
Task-centric design process
Comparison as Analysis Approach
Because we have a design process
An approach to thinking about
design for comparison
Examples using text analysis with
embedding comparison
4 considerations of comparison
123. Thanks!
To you for listening
To my students and collaborators
(too many to list)
To our sponsors
NSF, NIH, DARPA, Mellon Foundation
Michael Gleicher
http://pages.cs.wisc.edu/~gleicher
Editor's Notes
I present the basic framework in my class in 2010 – before phones had panoramic photography.
Danielle who later built sequence surveyor was there.
I mean comparison in it’s broad standard dictionary-use sense: the act of relating two or more things
To think about comparison in an abstract way, I have a set of four questions.
For each question, I have an abstraction of the space of answers – in the form categories for the answers.
This slide is the main piece of the work. But it’s a little much to take in abstractly, so let me introduce it with an example.
The idea of Serendip is that it was a tool for viewing topic models and using them to gain insights on the text collections that they were built from.
It was a system that was built in collaboration with humanities scholars and had 3 main views.
But, you should be asking if you should try thinking this way.