This document discusses approaches for visualizing comparisons of word embeddings. It begins by describing the challenges of interpreting word embeddings, which reside in high-dimensional spaces with complex relationships between words. It then reviews common tasks for analyzing word embeddings, such as identifying similar words and semantic directions. Existing visualization techniques are limited in supporting these tasks. The document proposes three new visualization designs: Buddy Plots to show local word similarities, Concept Axis Plots to reveal semantic directions between words, and Co-occurrence Matrices to display predictive relationships. These designs aim to help interpret word embeddings and guide linguistic analysis.
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Models such as latent semantic analysis and those based on neural embeddings learn distributed representations of text, and match the query against the document in the latent semantic space. In traditional information retrieval models, on the other hand, terms have discrete or local representations, and the relevance of a document is determined by the exact matches of query terms in the body text. We hypothesize that matching with distributed representations complements matching with traditional local representations, and that a combination of the two is favourable. We propose a novel document ranking model composed of two separate deep neural networks, one that matches the query and the document using a local representation, and another that matches the query and the document using learned distributed representations. The two networks are jointly trained as part of a single neural network. We show that this combination or ‘duet’ performs significantly better than either neural network individually on a Web page ranking task, and significantly outperforms traditional baselines and other recently proposed models based on neural networks.
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
Slides from my keynote talk at the Recherche d'Information SEmantique (RISE) workshop at CORIA-TALN 2018 conference in Rennes, France.
(Abstract)
Neural Information Retrieval (or neural IR) is the application of shallow or deep neural networks to IR tasks. Unlike classical IR models, these machine learning (ML) based approaches are data-hungry, requiring large scale training data before they can be deployed. Traditional learning to rank models employ supervised ML techniques—including neural networks—over hand-crafted IR features. By contrast, more recently proposed neural models learn representations of language from raw text that can bridge the gap between the query and the document vocabulary.
Neural IR is an emerging field and research publications in the area has been increasing in recent years. While the community explores new architectures and training regimes, a new set of challenges, opportunities, and design principles are emerging in the context of these new IR models. In this talk, I will share five lessons learned from my personal research in the area of neural IR. I will present a framework for discussing different unsupervised approaches to learning latent representations of text. I will cover several challenges to learning effective text representations for IR and discuss how latent space models should be combined with observed feature spaces for better retrieval performance. Finally, I will conclude with a few case studies that demonstrates the application of neural approaches to IR that go beyond text matching.
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Using Distributional Semantics” Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) -- Long Papers, Potsdam, Germany
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.
Towards a Distributional Semantic Web StackAndre Freitas
The ability of distributional semantic models (DSMs) to dis-
cover similarities over large scale heterogeneous and poorly structured data brings them as a promising universal and low-effort framework to support semantic approximation and knowledge discovery. This position paper explores the role of distributional semantics in the Semantic Web vision, based on the state-of-the-art distributional-relational models, categorizing and generalizing existing approaches into a Distributional Semantic Web stack.
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...Amit Sheth
Invited Talk, International Workshop on Ontology Matching
collocated with the 5th International Semantic Web Conference
ISWC-2006, November 5, 2006, Athens GA
Latent Semantic Analysis (LSA) is a mathematical technique for computationally modeling the meaning of words and larger units of texts. LSA works by applying a mathematical technique called Singular Value Decomposition (SVD) to a term*document matrix containing frequency counts for all words found in the corpus in all of the documents or passages in the corpus. After this SVD application, the meaning of a word is represented as a vector in a multidimensional semantic space, which makes it possible to compare word meanings, for instance by computing the cosine between two word vectors.
LSA has been successfully used in a large variety of language related applications from automatic grading of student essays to predicting click trails in website navigation. In Coh-Metrix (Graesser et al. 2004), a computational tool that produces indices of the linguistic and discourse representations of a text, LSA was used as a measure of text cohesion by assuming that cohesion increases as a function of higher cosine scores between adjacent sentences.
Besides being interesting as a technique for building programs that need to deal with semantics, LSA is also interesting as a model of human cognition. LSA can match human performance on word association tasks and vocabulary test. In this talk, Fridolin will focus on LSA as a tool in modeling language acquisition. After framing the area of the talk with sketching the key concepts learning, information, and competence acquisition, and after outlining presuppositions, an introduction into meaningful interaction analysis (MIA) is given. MIA is a means to inspect learning with the support of language analysis that is geometrical in nature. MIA is a fusion of latent semantic analysis (LSA) combined with network analysis (NA/SNA). LSA, NA/SNA, and MIA are illustrated by several examples.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Vectorland: Brief Notes from Using Text Embeddings for SearchBhaskar Mitra
(Invited talk at Search Solutions 2015)
A lot of recent work in neural models and “Deep Learning” is focused on learning vector representations for text, image, speech, entities, and other nuggets of information. From word analogies to automatically generating human level descriptions of images, the use of text embeddings has become a key ingredient in many natural language processing (NLP) and information retrieval (IR) tasks.
In this talk, I will present some personal learnings from working on (neural and non-neural) text embeddings for IR, as well as highlight a few key recent insights from the broader academic community. I will talk about the affinity of certain embeddings for certain kinds of tasks, and how the notion of relatedness in an embedding space depends on how the vector representations are trained. The goal of this talk is to encourage everyone to start thinking about text embeddings beyond just as an output of a “black box” machine learning model, and to highlight that the relationships between different embedding spaces are about as interesting as the relationships between items within an embedding space.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models may also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
In this talk, I will present my recent work on neural IR models. We begin with a discussion on learning good representations of text for retrieval. I will present visual intuitions about how different embeddings spaces capture different relationships between items, and their usefulness to different types of IR tasks. The second part of this talk is focused on the applications of deep neural architectures to the document ranking task.
As the volume of content continues to grow exponentially helping search engines to understand context and the topical themes within your site is increasingly important. Understanding some of the concepts are covered and also ways to utilise these in your marketing strategy.
Towards a Distributional Semantic Web StackAndre Freitas
The ability of distributional semantic models (DSMs) to dis-
cover similarities over large scale heterogeneous and poorly structured data brings them as a promising universal and low-effort framework to support semantic approximation and knowledge discovery. This position paper explores the role of distributional semantics in the Semantic Web vision, based on the state-of-the-art distributional-relational models, categorizing and generalizing existing approaches into a Distributional Semantic Web stack.
Continuous bag of words cbow word2vec word embedding work .pdfdevangmittal4
Continuous bag of words (cbow) word2vec word embedding work is that it tends to predict the
probability of a word given a context. A context may be a single word or a group of words. But for
simplicity, I will take a single context word and try to predict a single target word.
The purpose of this question is to be able to create a word embedding for the given data set.
data set text:
In linguistics word embeddings were discussed in the research area of distributional semantics. It
aims to quantify and categorize semantic similarities between linguistic items based on their
distributional properties in large samples of language data. The underlying idea that "a word is
characterized by the company it keeps" was popularized by Firth.
The technique of representing words as vectors has roots in the 1960s with the development of
the vector space model for information retrieval. Reducing the number of dimensions using
singular value decomposition then led to the introduction of latent semantic analysis in the late
1980s.In 2000 Bengio et al. provided in a series of papers the "Neural probabilistic language
models" to reduce the high dimensionality of words representations in contexts by "learning a
distributed representation for words". (Bengio et al, 2003). Word embeddings come in two different
styles, one in which words are expressed as vectors of co-occurring words, and another in which
words are expressed as vectors of linguistic contexts in which the words occur; these different
styles are studied in (Lavelli et al, 2004). Roweis and Saul published in Science how to use
"locally linear embedding" (LLE) to discover representations of high dimensional data structures.
The area developed gradually and really took off after 2010, partly because important advances
had been made since then on the quality of vectors and the training speed of the model.
There are many branches and many research groups working on word embeddings. In 2013, a
team at Google led by Tomas Mikolov created word2vec, a word embedding toolkit which can train
vector space models faster than the previous approaches. Most new word embedding techniques
rely on a neural network architecture instead of more traditional n-gram models and unsupervised
learning.
Limitations
One of the main limitations of word embeddings (word vector space models in general) is that
possible meanings of a word are conflated into a single representation (a single vector in the
semantic space). Sense embeddings are a solution to this problem: individual meanings of words
are represented as distinct vectors in the space.
For biological sequences: BioVectors
Word embeddings for n-grams in biological sequences (e.g. DNA, RNA, and Proteins) for
bioinformatics applications have been proposed by Asgari and Mofrad. Named bio-vectors
(BioVec) to refer to biological sequences in general with protein-vectors (ProtVec) for proteins
(amino-acid sequences) and gene-vectors (GeneVec) for gene sequences, this representa.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
Semantic Based Model for Text Document Clustering with IdiomsWaqas Tariq
Text document clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. Clustering is a very powerful data mining technique to organize the large amount of information on the web. Traditionally, document clustering methods do not consider the semantic structure of the document. This paper addresses the task of developing an effective and efficient method to improve the semantic structure of the text documents. A method has been developed that performs the following: tag the documents for parsing, replacement of idioms with their original meaning, semantic weights calculation for document words and apply semantic grammar. The similarity measure is obtained between the documents and then the documents are clustered using Hierarchical clustering algorithm. The method adopted in this work is evaluated on different data sets with standard performance measures and the effectiveness of the method to develop in meaningful clusters has been proved.
{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Compone...Amit Sheth
Invited Talk, International Workshop on Ontology Matching
collocated with the 5th International Semantic Web Conference
ISWC-2006, November 5, 2006, Athens GA
Latent Semantic Analysis (LSA) is a mathematical technique for computationally modeling the meaning of words and larger units of texts. LSA works by applying a mathematical technique called Singular Value Decomposition (SVD) to a term*document matrix containing frequency counts for all words found in the corpus in all of the documents or passages in the corpus. After this SVD application, the meaning of a word is represented as a vector in a multidimensional semantic space, which makes it possible to compare word meanings, for instance by computing the cosine between two word vectors.
LSA has been successfully used in a large variety of language related applications from automatic grading of student essays to predicting click trails in website navigation. In Coh-Metrix (Graesser et al. 2004), a computational tool that produces indices of the linguistic and discourse representations of a text, LSA was used as a measure of text cohesion by assuming that cohesion increases as a function of higher cosine scores between adjacent sentences.
Besides being interesting as a technique for building programs that need to deal with semantics, LSA is also interesting as a model of human cognition. LSA can match human performance on word association tasks and vocabulary test. In this talk, Fridolin will focus on LSA as a tool in modeling language acquisition. After framing the area of the talk with sketching the key concepts learning, information, and competence acquisition, and after outlining presuppositions, an introduction into meaningful interaction analysis (MIA) is given. MIA is a means to inspect learning with the support of language analysis that is geometrical in nature. MIA is a fusion of latent semantic analysis (LSA) combined with network analysis (NA/SNA). LSA, NA/SNA, and MIA are illustrated by several examples.
AN EMPIRICAL STUDY OF WORD SENSE DISAMBIGUATIONijnlc
Word Sense Disambiguation (WSD) is an important area which has an impact on improving the performance of applications of computational linguistics such as machine translation, information
retrieval, text summarization, question answering systems, etc. We have presented a brief history of WSD,
discussed the Supervised, Unsupervised, and Knowledge-based approaches for WSD. Though many WSD
algorithms exist, we have considered optimal and portable WSD algorithms as most appropriate since they
can be embedded easily in applications of computational linguistics. This paper will also provide an idea of
some of the WSD algorithms and their performances, which compares and assess the need of the word
sense disambiguation.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. Two talks (or five) in one!
How do we deal with
embeddings of text data?
Alexander, et. al. Serendip: Topic Model-Driven
Visual Exploration of Text Corpora.VAST ‘14.
Alexander & Gleicher. Task-Driven Comparison
of Topic Models.VAST ‘15
Heimerl & Gleicher. Interactive Analysis of
Word Vector Embeddings. EuroVis ’18.
Heimerl, et al. Interactive Visual Comparison of
Object Embeddings. (TVCG ‘21)
How do we use comparison as a
tool for hard data problems?
Gleicher. Considerations for Visualizing
Comparison. InfoVis 2017.
3. How do we deal with embeddings
of text data?
Alexander, et. al. Serendip: Topic Model-
Driven Visual Exploration of Text Corpora.
VAST ‘14.
Alexander & Gleicher. Task-Driven Comparison
of Topic Models.VAST ‘15
Heimerl & Gleicher. Interactive Analysis of
Word Vector Embeddings. EuroVis ’18.
Heimerl, et al. Interactive Visual Comparison of
Object Embeddings.
TVCG ‘21
Eric Alexander
Assistant Prof.
Carleton College
Florian Heimerl
Post-Doc
University ofWisconsin
4.
5. Michael Gleicher
University ofWisconsin
Visual Computing Group
Human Graphics Interaction
authoring pictures, videos, animations
Human Robot Interaction
robots!
Human Data Interaction
visualization, visual analytics, interactive learning
`
8. Two different stories…
Embeddings are a great way to
analyze text collections!
But they are hard to interpret!
Use comparison as a strategy
Comparison is a great way to
think about analysis problems!
But it’s abstract – need examples!
Use embeddings as a case study
10. What is an embedding?
General mathematics:
Place a smaller structure into a
larger structure
Computer science:
Place a discrete set of objects into
a vector space
Encode relationships between
objects
Paris
4,8,1,3,…
Atlanta
5,2,1,7,…
New York
4,8,1,3,…
London
5,2,1,7,…
Boston
3,2,5,1,…
Tokyo
9,2,6,4,…
Beijing
7,3,2,7,…
San Jose
5,2,1,7,…
Jakarta
3,2,5,1,…
Sydney
9,2,6,4,…
Munich
7,3,2,7,…
High Dimensional Data
Objects have associated Vectors
11. Kinds of relationships in embeddings
Distance
A is closer to B than to C
B
C
D
A
E
Linear Structure
A is to B as C is to D
Semantic Directions
A is more X than C
B
C
D
A
F
F
E
B
A
C
D
Relationships are interesting even if global positions are not
12. Embeddings
We care about relationships not values
The coordinates (axes) may have no meaning
A B
1
15. Embeddings for analyzingText Corpora
Embed Documents
Similar documents are close
Estimate similar meaning by
word usage statistics in a corpus
Embed Words
Similar words are close
Estimate similar meaning by
word usage statistics in a corpus
16. Document Similarity
Documents that use similar words
(probably) have similar content
Idea from the 1960s
https://www.nytimes.com/2019/01/02/obituaries/karen-sparck-jones-overlooked.html
17. Similarity based on word counts?
Need to deal with similar words (synonyms)
Need to reduce the number of words (too high dimensional)
Find groups of words
Pre-defined dictionaries (word types)
word usage analysis
Statistical groupings (words that tend to go together)
Topic Modeling
19. Word vector embeddings
Place words in a high-dimensional vector space
Words similar in meaning should be close in space
Infer similarity by distributional semantics:
similar context implies similar meaning
Construct embeddings by processing a corpus of text
my pet cat is brown
my pet dog is brown
my big car is brown
20. Why useWordVector Embeddings?
Learn about
Language or Corpora (Texts)
Find similar words/synonyms
Track changes of word usage
Exploring polysemy
Creating lexical resources
Evidence of bias
…
Natural Language Applications
Pre-Processing
Translation
Sentiment Analysis
Interpretation
…
21. Several ways to build word embeddings
Word2Vec
Skip-gram model
Neural embedding
GLoVE
Co-occurrence model
Factor matrix by optimization
Several ways to build topic models
(document embeddings)
LDA (Latent Dirichlet Allocation)
Standard algorithm
Iterative and probabilistic
NMF (Non-Negative Matrix Factors)
Fast Algorithm
Less widely adopted
22. Why Interpret Embeddings?
Gain insight into a model
Did you build a good model?
Gain insight into the modeling process
Which methods are better?
Gain insight about the underlying data
What does the model tell us?
23. Challenges ofWordVector Embeddings
Challenges of Document Embeddings
Large numbers ofWords (or documents)
High-dimensional spaces
Complex relationships – meaningless positions (or questionable)
Complex processes for building embeddings
Complex downstream applications
No ground truth - subjective aspects
24. Interactive Analysis ofWord
Vector Embeddings
Florian Heimerl and Michael Gleicher
Department of Computer Sciences
University ofWisconsin – Madison
EuroVis 2018
25. Summary:
Word vector embeddings offer unique challenges
Task analysis of needs 3 Designs for unmet needs
Buddy Plots
ConceptAxis Plots
Co-occurance Matrices
26. Task Analysis:
What do people do withWord Embeddings?
Literature survey
111 papers from diverse communities
Consider use cases in Linguistics, HCI, Digitial Humanities, etc.
Augment this list by extrapolation
what tasks would users want – but aren’t doing yet
27. Use cases suggest tasks
Learn about
Language or Corpora (Texts)
Find similar words/synonyms
Track changes of word usage
Exploring polysemy
Creating lexical resources
Natural Language Applications
Pre-Processing
Translation
Sentiment Analysis
Interpretation
Evaluation
Intrinsic (good embedding?)
Extrinsic (applications success?)
Interpretation
Identify items of interest
Probe values of interest
28. LinguisticTasks and Characteristics
We identified 7 distinct linguistic tasks within the literature
4 characteristics pertinent to those tasks: similarity, arithmetic
structures, concept axis, and co-occurrences
28
30. Tasks and Characteristics vs.Visualizations
Rank word pairs similarity
View neighbors similarity
Select synonyms similarity
Compare concepts average, similarity
Find analogies offset, similarity
Project on Concept concept axis
Predict Contexts co-oc. probability
Smilkov, et al 2016
Liu, et al 2017
31. Tasks and Characteristics vs.Visualizations
Rank word pairs similarity
View neighbors similarity
Select synonyms similarity
Compare concepts average, similarity
Find analogies offset, similarity
Project on Conceptconcept axis
Predict Contexts co-oc. probability
Smilkov, et al 2016
Liu, et al 2017
Tasks we seek designs for in this paper
36. Ways to define axes
Vector between two concepts
Interaxis - Kim et al., 2015
Classifier between two groups
Explainers – Gleicher, 2013
Food
Pet
Dog
Cat
Goldfish
Chicken
Cow
Trout
40. Similarities:
Understanding local distances
Distances are meaningful
even if absolute values are not
What is close to a word?
Are there groups of words that are similar?
Ordered lists are useful
Density (how many can you show)
Sense of relative distances
Comparison between words
Embedding Projector
Smilkov, et al. 2016
41. Buddy Plots (1D lists)
Alexander and Gleicher, 2016 – forTopic Models
Map distance (to selected reference) to horizontal axis
Reference
object
Closest
to ref
49. Comparison?
For interpretation
Do the differences show
something?
Word meaning change
(between corpora)
Correlations / Biases
(within corpora)
For modeling (selection)
How are these models different?
Which is better?
50. Comparison?
Comparison is an important task
in Data Analysis
Comparison is special!
(since it involves multiple things)
It’s an important special case
It deserves special attention
Almost all Data Analysis can be
viewed as comparison
Comparison is a lens
(to look at problems)
It’s a generally useful tool
It deserves special attention
51. How do I think about
comparison?
to help me develop tools to help people do it
55. What is this paper?
Considerations for Visualizing Comparison
4 questions to ask when designing a
visualization or tool ?
56. To examine (two or more
objects, ideas, people, etc.) in
order to note similarities and
differences
To mark or point out the
similarities and differences of
(two or more things)
57. What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
58. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
59. Wrong Question:
Is my problem Comparison?
Just about anything can be viewed as comparison
Not everything benefits from being viewed this way
60. Serendip, VAST ‘14
Alexander, E., Kohlmann, J., Valenza, R., Witmore, M., &
Gleicher, M. (2014). Serendip: Topic model-driven visual
exploration of text corpora. In 2014 IEEE Conference on
Visual Analytics Science and Technology (VAST)
61. Serendip: Topic model-driven visual
exploration of text corpora
Use a topic model to guide exploration of a text corpus
Find patterns and connect back to specifics
Documents Model Documents/Passages
63. Goals
Support inquiry across levels of abstraction
Combat issues of scale in the data
Many documents Long documents
64. Goals
Support inquiry across levels of abstraction
Combat issues of scale
Promote serendipitous discovery:
Multiple entry points
Highlight adjacency
Flexible exploration
A. Thudt, U. Hinrichs, S. Carpendale 2012
65. Basics of Serendip
Three different (interlinked views)
Corpus Document Passage Word
CorpusViewer TextViewer RankViewer
66.
67. Is this comparison?
No!
A tool for exploring a topic
model!
We didn’t describe it as comparison
Tool for looking at one topic model
Unclear how users think about it
Yes!
Comparison thinking really
helped!
We did think about comparison
Tool for using topic models
Our users had comparison tasks
68. Is this comparison? I don’t care!
No!
A tool for exploring a topic model!
We didn’t describe it as comparison
Tool for looking at one topic model
Unclear how users think about it
Yes!
Comparison thinking really helped!
We did think about comparison
Tool for using topic models
Our users had comparison tasks
A survey of comparison
would have missed this.
It’s a great example of
comparison ideas
69. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
70. Question 1:
What is the comparison?
What are the elements of the comparison?
71. The Elements of a Comparison . . .
Targets — Set of things being compared
Action — What to do with the relationship among them
To mark or point out the
similarities and differences of
(two or more things)
To examine
(two or more objects, ideas, etc.)
in order to note
similarities and differences
72. Question 1A:
The Elements: Targets
Do you know what you are comparing?
Explicit Comparisons – the system has the set of targets
Implicit Comparisons – the system may not know all the targets
compare against an implicit baseline
compare against the user’s knowledge
compare with targets only the user knows
73. Question 1A:
The Elements: Targets
What is being compared? – Comparison Targets
Does the model match my expectations?
What documents are similar?
How do groups of documents differ?
What words indicate these differences?
How are words used differently?
Where in texts are these differences?
Do the patterns match other things I know?
74. Question 1B:
The Elements: Actions
Verbs on relationships
Try to be more specific than “examine” or “compare”
Truth in Advertising: I didn’t have this worked out in 2010
75. Question 1B:
The Elements: Actions
What to do with the relationship? Comparison Actions (Verbs)
Does the model match my expectations? Measure/Quantify relationship
What documents are similar? Identify similar things
How do groups of documents differ? Measure/Quantify relationship
What words indicate these differences? Dissect a difference
How are words used differently? Identify meaningful differences
Where in texts are these differences? Contextualize the relationships
Do the patterns match other things I know? Identify similar things
76. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
77. Question 2:
Why is this comparison hard?
If it isn’t hard, you probably don’t need to think about it (much)
Abstractly
Too many targets to compare
Large or Complex Targets
Complex Relationships
Serendip
Challenges of Scale!
Lots of documents
Long / complex documents
Complicated models
78. Many different comparisons
Challenges from the kind (not scale)
• Hard target types (implicit)
• Hard action types (dissection)
• Hard combinations (dissect implicit)
Tasks influence scalability challenges
Solutions must respond to both!
Only Scalability Challenges?
79. Serendip Comparison Example:
Where does this happen? (contextualize)
Task Challenge:
Contextualize – fit user knowledge
Strategy: show in context
Design: use text as scaffold
Scalability Challenge:
Long Documents
Strategy: summarize
Design: overview + detail
81. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
82. Question 3:
What is your strategy for those challenges?
Abstractly
Scan Sequentially
Select Subset
Summarize Somehow
Sarikaya, Gleicher & Szafir. (2018). Design Factors
for Summary Visualization in Visual Analytics.
Computer Graphics Forum, 37(3), 145–156.
EuroVis 2018.
Scalability Strategies!
83. Serendip Comparison Example:
Compare Groups
Task Challenge:
Implicit targets – what groups?
Strategy: make explicit
Design: user specifies groups
Scalability Challenge:
Lots of documents
Strategy: summarize
Design: how to present statistics?
84. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
85. Question 4:
What Visual Design for Comparison?
Abstractly
Juxtaposition
Superposition
Explicit Encoding
Gleicher, M., Albers, D., Walker, R., Jusufi, I., Hansen, C. D., & Roberts, J. C. (2011).
Visual comparison for information visualization. Information Visualization,
10(4), 289–309.
86. Comparative Elements
Targets
Actions
Comparative Challenges
Number of Targets
Large or Complex Targets
Complex Relationships
Scalability Strategies
Scan Sequentially
Select Subset
Summarize Somehow
Comparative Designs
Juxtapose
Superpose
Explicit Encoding
What is the comparison? Why is it hard?
How to address the challenges? Which visual design to use?
88. If you’re so good at
comparison…
We should be able to compare complex things
89. Interactive Visual Comparison of
Object Embeddings
Florian Heimerl, Christoph Kralj,
Torsten Möller and Michael Gleicher
University of Wisconsin – Madison, University of Vienna
TVCG 2020 (presented at VIS ‘21)
90. The Problem:
Compare two embeddings
We are interested in the relationships between objects
local structure
Not their positions in space
global structure
91. back to the introductory example…
10 dimensional data mapped to 2 dimensions
867 objects
2 runs ofTSNE (different random seeds)
Perplexity 30, Learning Rate 200
Local structure (near neighbors) should be preserved
(that’s what the algorithm does)
94. Is the local structure really similar?
measure difference/similarity?
assess and localize it?
interpret and diagnose it?
identify exemplars?
understand context?
Metrics (multiple)
Summary views
Link between views
Connect to detail views
Connect back to global views
108. A real example:
WordVector Embeddings, 2 Coprora
Wikipedia (modern) vs EEBO (1470-1700)
GloVE (embedding algorithm)
300 dims (key parameter of algorithm)
5000 most common words (can’t show all words)
115. Another example:
Build topic models with abstracts?
Vispub data (871 papers)
Abstracts vs. FullText
NMF, 10 topics
Hypothesis: Abstracts are different than papers
116.
117. Dimensionality Reduction ofTopic Models
(abstracts vs. texts)
Red dots are the “topics” – 1 hot vectors of each topic (unit dimensions in each vector)
121. Limitations
Binary comparison – less good for parameter tuning
need new designs?
Many complex views that need to be combined
address usability – through task-driven pre-arrangements?
Scalability of implementation and visual designs
Validation on real problems with real users
122. Summary
Embeddings forText Analysis
If we can design interpretation tools
Compare document and word
embeddings to interpret them
Specialized tools for comparison
of embeddings
Task-centric design process
Comparison as Analysis Approach
Because we have a design process
An approach to thinking about
design for comparison
Examples using text analysis with
embedding comparison
4 considerations of comparison
123. Thanks!
To you for listening
To my students and collaborators
(too many to list)
To our sponsors
NSF, NIH, DARPA, Mellon Foundation
Michael Gleicher
http://pages.cs.wisc.edu/~gleicher
Editor's Notes
I present the basic framework in my class in 2010 – before phones had panoramic photography.
Danielle who later built sequence surveyor was there.
I mean comparison in it’s broad standard dictionary-use sense: the act of relating two or more things
To think about comparison in an abstract way, I have a set of four questions.
For each question, I have an abstraction of the space of answers – in the form categories for the answers.
This slide is the main piece of the work. But it’s a little much to take in abstractly, so let me introduce it with an example.
The idea of Serendip is that it was a tool for viewing topic models and using them to gain insights on the text collections that they were built from.
It was a system that was built in collaboration with humanities scholars and had 3 main views.
But, you should be asking if you should try thinking this way.