This document provides an overview of word embeddings and the Word2Vec algorithm. It begins by establishing that measuring document similarity is an important natural language processing task, and that representing words as vectors is an effective approach. It then discusses different methods for representing words as vectors, including one-hot encoding and distributed representations. Word2Vec is introduced as a model that learns word embeddings by predicting words in a sentence based on context. The document demonstrates how Word2Vec can be used to find word similarities and analogies. It also references the theoretical justification that words with similar contexts have similar meanings.
The document discusses word vectors for natural language processing. It explains that word vectors represent words as dense numeric vectors which encode the words' semantic meanings based on their contexts in a large text corpus. These vectors are learned using neural networks which predict words from their contexts. This allows determining relationships between words like synonyms which have similar contexts, and performing operations like finding analogies. Examples of using word vectors include determining word similarity, analogies, and translation.
- The document discusses different approaches to defining word meaning, including lexicographic traditions of enumerating senses in dictionaries, ontological approaches using taxonomies of concepts, and distributional approaches using vector representations based on word context.
- It covers challenges with the traditional word sense disambiguation task, such as the skewed distribution of word senses and implicit disambiguation in context. Dimensionality reduction techniques and models like word2vec are discussed as distributional methods to learn word vectors from large corpora that capture semantic relationships.
English dictionaries since 1755 have attempted to present succinct statements of the meaning(s) of each word. A word may have more than one meaning but, so the theory goes, each meaning can in principle be summarized in a neat paraphrase that is substitutable (in context) for the target word (the definiendum). Such paraphrases must be so worded that the the substitution can be made without changing the truth of what is said – salva veritate, in Leibniz’s famous phrase. Building on Leibniz, philosophers of language such as Anna Wierzbicka have argued that the duty of the lexicographer is to “seek the invariant”.
In this presentation, I argue that this view of word meaning and definition may be all very well as a principle for developing stipulative definitions of terminology in scientific discourse, but it has led to serious misunderstandings about the nature of meaning in natural language, creating insuperable obstacles for the understanding of how word meaning works. As a result, linguists from Bloomfield to Chomsky and philosophers of language from Leibniz to Russell – great thinkers all – have been unable to say anything true or useful about meaning in language.
I argue that, instead, lexicographers should aim to discover patterns of word use in large corpora, and associate meanings with patterns instead of (or as well as) words in isolation.
They should also distinguish normal uses of each word from exploitations of norms.
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
This document discusses natural language processing (NLP) and language modeling. It covers the basics of NLP including what NLP is, its common applications, and basic NLP processing steps like parsing. It also discusses word and sentence modeling in NLP, including word representations using techniques like bag-of-words, word embeddings, and language modeling approaches like n-grams, statistical modeling, and neural networks. The document focuses on introducing fundamental NLP concepts.
Distributional semantics models can provide probabilistic information about word meanings based on contextual clues. An agent can use distributional evidence to update its probabilistic information state about unknown words. In experiments, distributional similarity evidence increased the probability that properties of a known word (like "crocodile") also apply to an unknown word ("alligator"). Higher similarities led to more confident inferences. Combining multiple pieces of evidence further increased probabilities, allowing an agent to infer an unknown word refers to an animal based on its similarities to both "crocodile" and "trout".
This document provides an introduction to natural language processing and word representation techniques. It discusses how words can take on different meanings based on context and how words may be related in some dimensions but not others. It also outlines criteria for a good word representation system, such as capturing different semantic interpretations of words and enabling similarity comparisons. The document then reviews different representation approaches like discrete, co-occurrence matrices, and word2vec, noting issues with earlier approaches and how word2vec uses skip-gram models and sliding windows to learn word vectors in a low-dimensional space.
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
This document discusses word embeddings and how they work. It begins by explaining how the author became an expert in distributional semantics without realizing it. It then discusses how word2vec works, specifically skip-gram models with negative sampling. The key points are that word2vec is learning word and context vectors such that related words and contexts have similar vectors, and that this is implicitly factorizing the word-context pointwise mutual information matrix. Later sections discuss how hyperparameters are important to word2vec's success and provide critiques of common evaluation tasks like word analogies that don't capture true semantic similarity. The overall message is that word embeddings are fundamentally doing the same thing as older distributional semantic models through matrix factorization.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
The document discusses word vectors for natural language processing. It explains that word vectors represent words as dense numeric vectors which encode the words' semantic meanings based on their contexts in a large text corpus. These vectors are learned using neural networks which predict words from their contexts. This allows determining relationships between words like synonyms which have similar contexts, and performing operations like finding analogies. Examples of using word vectors include determining word similarity, analogies, and translation.
- The document discusses different approaches to defining word meaning, including lexicographic traditions of enumerating senses in dictionaries, ontological approaches using taxonomies of concepts, and distributional approaches using vector representations based on word context.
- It covers challenges with the traditional word sense disambiguation task, such as the skewed distribution of word senses and implicit disambiguation in context. Dimensionality reduction techniques and models like word2vec are discussed as distributional methods to learn word vectors from large corpora that capture semantic relationships.
English dictionaries since 1755 have attempted to present succinct statements of the meaning(s) of each word. A word may have more than one meaning but, so the theory goes, each meaning can in principle be summarized in a neat paraphrase that is substitutable (in context) for the target word (the definiendum). Such paraphrases must be so worded that the the substitution can be made without changing the truth of what is said – salva veritate, in Leibniz’s famous phrase. Building on Leibniz, philosophers of language such as Anna Wierzbicka have argued that the duty of the lexicographer is to “seek the invariant”.
In this presentation, I argue that this view of word meaning and definition may be all very well as a principle for developing stipulative definitions of terminology in scientific discourse, but it has led to serious misunderstandings about the nature of meaning in natural language, creating insuperable obstacles for the understanding of how word meaning works. As a result, linguists from Bloomfield to Chomsky and philosophers of language from Leibniz to Russell – great thinkers all – have been unable to say anything true or useful about meaning in language.
I argue that, instead, lexicographers should aim to discover patterns of word use in large corpora, and associate meanings with patterns instead of (or as well as) words in isolation.
They should also distinguish normal uses of each word from exploitations of norms.
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
This document discusses natural language processing (NLP) and language modeling. It covers the basics of NLP including what NLP is, its common applications, and basic NLP processing steps like parsing. It also discusses word and sentence modeling in NLP, including word representations using techniques like bag-of-words, word embeddings, and language modeling approaches like n-grams, statistical modeling, and neural networks. The document focuses on introducing fundamental NLP concepts.
Distributional semantics models can provide probabilistic information about word meanings based on contextual clues. An agent can use distributional evidence to update its probabilistic information state about unknown words. In experiments, distributional similarity evidence increased the probability that properties of a known word (like "crocodile") also apply to an unknown word ("alligator"). Higher similarities led to more confident inferences. Combining multiple pieces of evidence further increased probabilities, allowing an agent to infer an unknown word refers to an animal based on its similarities to both "crocodile" and "trout".
This document provides an introduction to natural language processing and word representation techniques. It discusses how words can take on different meanings based on context and how words may be related in some dimensions but not others. It also outlines criteria for a good word representation system, such as capturing different semantic interpretations of words and enabling similarity comparisons. The document then reviews different representation approaches like discrete, co-occurrence matrices, and word2vec, noting issues with earlier approaches and how word2vec uses skip-gram models and sliding windows to learn word vectors in a low-dimensional space.
Yoav Goldberg: Word Embeddings What, How and WhitherMLReview
This document discusses word embeddings and how they work. It begins by explaining how the author became an expert in distributional semantics without realizing it. It then discusses how word2vec works, specifically skip-gram models with negative sampling. The key points are that word2vec is learning word and context vectors such that related words and contexts have similar vectors, and that this is implicitly factorizing the word-context pointwise mutual information matrix. Later sections discuss how hyperparameters are important to word2vec's success and provide critiques of common evaluation tasks like word analogies that don't capture true semantic similarity. The overall message is that word embeddings are fundamentally doing the same thing as older distributional semantic models through matrix factorization.
The document provides an introduction to natural language processing (NLP), discussing key related areas and various NLP tasks involving syntactic, semantic, and pragmatic analysis of language. It notes that NLP systems aim to allow computers to communicate with humans using everyday language and that ambiguity is ubiquitous in natural language, requiring disambiguation. Both manual and automatic learning approaches to developing NLP systems are examined.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
This document provides an overview of natural language processing (NLP) and the use of deep learning for NLP tasks. It discusses how deep learning models can learn representations and patterns from large amounts of unlabeled text data. Deep learning approaches are now achieving superior results to traditional NLP methods on many tasks, such as named entity recognition, machine translation, and question answering. However, deep learning models do not explicitly model linguistic knowledge. The document outlines common NLP tasks and how deep learning algorithms like LSTMs, CNNs, and encoder-decoder models are applied to problems involving text classification, sequence labeling, and language generation.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
The document provides strategies for getting published in international journals, including editing for strength, clarity, and argumentation. It recommends shortening words and phrases, eliminating redundancies and cliches, avoiding ambiguous language, and writing in an active voice for clarity. Proper use of pronouns and parallel structure are emphasized.
This chapter introduces vector semantics for representing word meaning in natural language processing applications. Vector semantics learns word embeddings from text distributions that capture how words are used. Words are represented as vectors in a multidimensional semantic space derived from neighboring words in text. Models like word2vec use neural networks to generate dense, real-valued vectors for words from large corpora without supervision. Word vectors can be evaluated intrinsically by comparing similarity scores to human ratings for word pairs in context and without context.
The Ins and Outs of Preposition Semantics: Challenges in Comprehensive Corpu...Seth Grimes
Presentation by Nathan Scheider, Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019, https://www.meetup.com/DC-NLP/events/264894589/.
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
The document summarizes Nathan Schneider's presentation on preposition semantics. It discusses challenges in annotating prepositions in corpora and approaches to their semantic description and disambiguation. It presents Schneider's work on developing a unified semantic scheme for prepositions and possessives consisting of 50 semantic classes applied to a corpus of English web reviews. Inter-annotator agreement for the new corpus was 78%. Models for preposition disambiguation were evaluated, with the feature-rich linear model achieving the highest accuracy of 80%.
The document summarizes research on using different semantic techniques like contexts, co-occurrences, and ontologies to build a "semantic quilt" that can be used for natural language processing tasks. It discusses using n-gram statistics to identify associated words, sense clusters to identify similar contexts, and WordNet to measure conceptual similarity. The goal is to integrate these different semantic resources and methods to solve problems with less reliance on manually built resources.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Artificial Thinking: can machines reason with analogies? Federico Bianchi
The document discusses research on machines' ability to reason through analogies. It describes how machines are currently limited in several aspects of analogical reasoning that humans excel at, such as learning from limited examples, connecting different cognitive skills, and imagining hypothetical scenarios. Researchers are working on techniques like transfer learning, multi-task learning, and generative models to help machines better approach human-level analogical reasoning abilities.
This document provides information on language and precise communication. It discusses (1) the need for precision to avoid misunderstandings, (2) ways language can be imprecise including vagueness, overgenerality, and ambiguity, and (3) the importance of precise definitions. Precise definitions are needed to communicate clearly and support arguments. Strategies for defining terms include ostensive, enumerative, subclass, etymological, synonymous, and genus/difference definitions.
From Natural Language Processing to Artificial IntelligenceJonathan Mugan
Overview of natural language processing (NLP) from both symbolic and deep learning perspectives. Covers tf-idf, sentiment analysis, LDA, WordNet, FrameNet, word2vec, and recurrent neural networks (RNNs).
Cognitive processes such as thinking, problem solving, language, and intelligence involve complex mental activities. Thinking refers to making sense of and changing the world through attention, mental representation, reasoning, judgment, and decision making. Problem solving uses strategies like algorithms, heuristics, analogies, and overcoming biases. Language allows for complex communication and shapes thought and culture. Theories of intelligence propose that it involves multiple abilities and can be analyzed through factors, domains, and problem-solving styles.
This document provides an overview of natural language processing (NLP). It discusses how NLP systems have achieved shallow matching to understand language but still have fundamental limitations in deep understanding that requires context and linguistic structure. It also describes technologies like speech recognition, text-to-speech, question answering and machine translation. It notes that while text data may seem superficial, language is complex with many levels of structure and meaning. Corpus-based statistical methods are presented as one approach in NLP.
Use Your Words: Content Strategy to Influence BehaviorLiz Danzico
What if we were truly open to the language in our cities, our neighborhoods, our city blocks? What is our environment telling us to do?
In this workshop, we’ll let the language of the city guide us to explore how words, specifically the words of our immediate contexts, shape our behavior. By being open to the possibilities, we’ll explore how language influences both the micro and macro actions we take. We’ll go on expeditions in the morning—studying street signs to doorways to receipts—comparing patterns in the language maps we’ll construct. In the afternoon, we’ll look at what these patterns suggest for the products and services we design.
You’ll walk away having learned how words influence behavior, how products and services have used language for behavior change, and having tools for thinking about language and behavior change in the work you do.
Spend the day letting words use you, so you can go back to work to use them with renewed wisdom.
The document discusses the key concepts of syntax including:
- Syntax examines how words are combined to form sentences.
- Speakers have linguistic competence which includes understanding grammaticality, word order, constituents, functions, ambiguity, and paraphrase.
- Generative grammar uses phrase structure rules to represent the hierarchical structure of sentences and generate all possible grammatical sentences.
- Tests like substitution and movement are used to determine if a string of words forms a constituent.
The document discusses several methods for calculating the similarity between text documents, including document vectors, word embeddings, TF-IDF, cosine similarity, and Jaccard similarity. It explains that document vectors transform documents into real-valued vectors to measure similarity as distance. Word embeddings represent words as vectors to capture semantic similarity. TF-IDF measures word importance, and cosine similarity measures the angle between document vectors to indicate similarity. Jaccard similarity calculates the overlap between word sets in two documents.
This document provides an overview of semantics, the study of meaning in language. It discusses several key concepts in semantics, including:
- The meaning of words, phrases, and sentences.
- Truth conditions and how the meaning of a sentence relates to whether it is true or false.
- Compositional semantics and how the meaning of parts combine to form the meaning of the whole.
- Lexical semantics and the different types of meanings words can have, including sense, reference, and lexical relations like synonyms and antonyms.
- How semantic features reflect our knowledge about word meanings.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
This document provides an overview of natural language processing (NLP) and the use of deep learning for NLP tasks. It discusses how deep learning models can learn representations and patterns from large amounts of unlabeled text data. Deep learning approaches are now achieving superior results to traditional NLP methods on many tasks, such as named entity recognition, machine translation, and question answering. However, deep learning models do not explicitly model linguistic knowledge. The document outlines common NLP tasks and how deep learning algorithms like LSTMs, CNNs, and encoder-decoder models are applied to problems involving text classification, sequence labeling, and language generation.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
The document provides strategies for getting published in international journals, including editing for strength, clarity, and argumentation. It recommends shortening words and phrases, eliminating redundancies and cliches, avoiding ambiguous language, and writing in an active voice for clarity. Proper use of pronouns and parallel structure are emphasized.
This chapter introduces vector semantics for representing word meaning in natural language processing applications. Vector semantics learns word embeddings from text distributions that capture how words are used. Words are represented as vectors in a multidimensional semantic space derived from neighboring words in text. Models like word2vec use neural networks to generate dense, real-valued vectors for words from large corpora without supervision. Word vectors can be evaluated intrinsically by comparing similarity scores to human ratings for word pairs in context and without context.
The Ins and Outs of Preposition Semantics: Challenges in Comprehensive Corpu...Seth Grimes
Presentation by Nathan Scheider, Georgetown University, to the Washington DC Natural Language Processing meetup, October 14, 2019, https://www.meetup.com/DC-NLP/events/264894589/.
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
The document summarizes Nathan Schneider's presentation on preposition semantics. It discusses challenges in annotating prepositions in corpora and approaches to their semantic description and disambiguation. It presents Schneider's work on developing a unified semantic scheme for prepositions and possessives consisting of 50 semantic classes applied to a corpus of English web reviews. Inter-annotator agreement for the new corpus was 78%. Models for preposition disambiguation were evaluated, with the feature-rich linear model achieving the highest accuracy of 80%.
The document summarizes research on using different semantic techniques like contexts, co-occurrences, and ontologies to build a "semantic quilt" that can be used for natural language processing tasks. It discusses using n-gram statistics to identify associated words, sense clusters to identify similar contexts, and WordNet to measure conceptual similarity. The goal is to integrate these different semantic resources and methods to solve problems with less reliance on manually built resources.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
Artificial Thinking: can machines reason with analogies? Federico Bianchi
The document discusses research on machines' ability to reason through analogies. It describes how machines are currently limited in several aspects of analogical reasoning that humans excel at, such as learning from limited examples, connecting different cognitive skills, and imagining hypothetical scenarios. Researchers are working on techniques like transfer learning, multi-task learning, and generative models to help machines better approach human-level analogical reasoning abilities.
This document provides information on language and precise communication. It discusses (1) the need for precision to avoid misunderstandings, (2) ways language can be imprecise including vagueness, overgenerality, and ambiguity, and (3) the importance of precise definitions. Precise definitions are needed to communicate clearly and support arguments. Strategies for defining terms include ostensive, enumerative, subclass, etymological, synonymous, and genus/difference definitions.
From Natural Language Processing to Artificial IntelligenceJonathan Mugan
Overview of natural language processing (NLP) from both symbolic and deep learning perspectives. Covers tf-idf, sentiment analysis, LDA, WordNet, FrameNet, word2vec, and recurrent neural networks (RNNs).
Cognitive processes such as thinking, problem solving, language, and intelligence involve complex mental activities. Thinking refers to making sense of and changing the world through attention, mental representation, reasoning, judgment, and decision making. Problem solving uses strategies like algorithms, heuristics, analogies, and overcoming biases. Language allows for complex communication and shapes thought and culture. Theories of intelligence propose that it involves multiple abilities and can be analyzed through factors, domains, and problem-solving styles.
This document provides an overview of natural language processing (NLP). It discusses how NLP systems have achieved shallow matching to understand language but still have fundamental limitations in deep understanding that requires context and linguistic structure. It also describes technologies like speech recognition, text-to-speech, question answering and machine translation. It notes that while text data may seem superficial, language is complex with many levels of structure and meaning. Corpus-based statistical methods are presented as one approach in NLP.
Use Your Words: Content Strategy to Influence BehaviorLiz Danzico
What if we were truly open to the language in our cities, our neighborhoods, our city blocks? What is our environment telling us to do?
In this workshop, we’ll let the language of the city guide us to explore how words, specifically the words of our immediate contexts, shape our behavior. By being open to the possibilities, we’ll explore how language influences both the micro and macro actions we take. We’ll go on expeditions in the morning—studying street signs to doorways to receipts—comparing patterns in the language maps we’ll construct. In the afternoon, we’ll look at what these patterns suggest for the products and services we design.
You’ll walk away having learned how words influence behavior, how products and services have used language for behavior change, and having tools for thinking about language and behavior change in the work you do.
Spend the day letting words use you, so you can go back to work to use them with renewed wisdom.
The document discusses the key concepts of syntax including:
- Syntax examines how words are combined to form sentences.
- Speakers have linguistic competence which includes understanding grammaticality, word order, constituents, functions, ambiguity, and paraphrase.
- Generative grammar uses phrase structure rules to represent the hierarchical structure of sentences and generate all possible grammatical sentences.
- Tests like substitution and movement are used to determine if a string of words forms a constituent.
The document discusses several methods for calculating the similarity between text documents, including document vectors, word embeddings, TF-IDF, cosine similarity, and Jaccard similarity. It explains that document vectors transform documents into real-valued vectors to measure similarity as distance. Word embeddings represent words as vectors to capture semantic similarity. TF-IDF measures word importance, and cosine similarity measures the angle between document vectors to indicate similarity. Jaccard similarity calculates the overlap between word sets in two documents.
This document provides an overview of semantics, the study of meaning in language. It discusses several key concepts in semantics, including:
- The meaning of words, phrases, and sentences.
- Truth conditions and how the meaning of a sentence relates to whether it is true or false.
- Compositional semantics and how the meaning of parts combine to form the meaning of the whole.
- Lexical semantics and the different types of meanings words can have, including sense, reference, and lexical relations like synonyms and antonyms.
- How semantic features reflect our knowledge about word meanings.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
3. Document Similarity
Hurricane Gilbert swept toward the Dominican
Republic Sunday , and the Civil Defence alerted its
heavily populated south coast to prepare for high
winds, heavy rains and high seas.
The storm was approaching from the southeast with
sustained winds of 75 mph gusting to 92 mph .
“There is no need for alarm," Civil Defence Director
Eugenio Cabral said in a television alert shortly
before midnight Saturday .
Cabral said residents of the province of Barahona
should closely follow Gilbert 's movement .
An estimated 100,000 people live in the province,
including 70,000 in the city of Barahona , about 125
miles west of Santo Domingo .
Tropical Storm Gilbert formed in the eastern Caribbean
and strengthened into a hurricane Saturday night
The National Hurricane Centre in Miami reported its
position at 2a.m. Sunday at latitude 16.1 north ,
longitude 67.5 west, about 140 miles south of Ponce,
Puerto Rico, and 200 miles southeast of Santo
Domingo.
The National Weather Service in San Juan , Puerto Rico ,
said Gilbert was moving westward at 15 mph with a
"broad area of cloudiness and heavy weather"
rotating around the centre of the storm.
The weather service issued a flash flood watch for Puerto
Rico and the Virgin Islands until at least 6p.m. Sunday.
Strong winds associated with the Gilbert brought coastal
flooding , strong southeast winds and up to 12 feet
to Puerto Rico 's south coast.
Ref: A text from DUC 2002 on “Hurricane Gilbert” 24 sentences
4. How to measure similarity?
Document 1
• Gilbert: 3
• Hurricane: 2
• Rains: 1
• Storm: 2
• Winds: 2
Document 2
• Gilbert: 2
• Hurricane: 1
• Rains: 0
• Storm: 1
• Winds: 2
• To measure similarity of documents, we need some number.
• To understand document similarity, we need to understand sentences and words. How similar are
my words? How are they related?
Sent1 : The problem likely will mean
corrective changes before the shuttle
fleet starts flying again.
Sent2 : The issue needs to be solved
before the spacecraft fleet is cleared to
shoot again.
Ref. Microsoft research corpus
5. Words can be similar or related in many
ways…
• They mean the same thing (synonyms)
• They mean the opposite (antonyms)
• They are used in the same way (red, green)
• They are used in the same context (doctor, hospital, scalpel)
• They are of same type (cat, dog -> mammal)
• They occur in different times(swim, swimming)
6. How to measure the word similarity?
• We need a number, preferably between (0,1)
• We need to represent words in some numerical format as well.
• We need word representation for computers to manipulate the
representation in meaningful way.
• Scaler or vector? Vector is better so that it can capture multiple levels
dimension of similarity.
7. Representing words as vectors
• Limitation on understanding meaning of the word (neurophysiological).Can
we instead, have a computational model that is consistent with usage?
• Let’s represent words as vectors.
• We want to construct them so that similar words have similar vectors.
• Similarity-is-Proximity : two similar things can be conceptualized as being
near each other
• Entities-are-Locations : in order for two things to be close to each other,
they need to have a spatial location
8. One Hot Encoded
bear = [1,0,0] cat = [0,1,0] frog = [0,0,1]
What is bear ^ cat?
Too may dimensions.
Data structure is sparse.
This is also called discrete or local representation
Ref : Constructing and Evaluating Word Embeddings - Marek Rei
9. Hot Problems with One Hot…
• Dimensions of vectors scales with size of vocabulary
• Must pre-determine vocabulary size.
• Cannot scale to large or infinite vocabularies (Zipf’s law!)
• ‘Out-of-Vocabulary’ (OOV) problem. How would you handle unseen
words in the test set?
• No relationship between words.
10. What if we decide on dimensions?
bear = [0.9,0.85] cat = [0.85, 0.15]
What is bear ^ cat?
How many dimensions?
How do we know the dimensions for our vocabulary?
This is known as distributed representation.
Is it theoretically possible to come up with
limited set of features to exhaustively cover the
meaning space?
Ref : Constructing and Evaluating Word Embeddings - Marek Rei
11. Measure of similarity
cos(bear, cat) = 0.15
cos(bear, lion) = 0.9
We can infer some information, based only on the
vector of the word.
We don’t even need to know the labels on the vector
elements.
Ref : Constructing and Evaluating Word Embeddings - Marek Rei
12. Vector Space Creates a n-dimensional space.
Represent each word in as a point in space, where it is
represented by a vector of fixed number of dimensions
(generally 100-500 dimensions).
Information about a particular feature distributed among a
set of (not necessarily mutually exclusive) dimensions.
All the words with a very with some relation will be near to
each other.
Word vectors in distributed form are
• Dense
• Compressed (low dimension)
• Smooth (discrete to continuous)
Ref : Constructing and Evaluating Word Embeddings - Marek Rei
13. Story so far…
• Document similarity is one of the fundamental task of NLP.
• To infer document similarity, we need to express word similarity in
numerical terms.
• Words are similar in many ‘senses’ (loose definition).
• Mapping words in common space (embedding) looks like possible
solution to understand semantic relationships of the words.
Often referred to as “word embeddings”, as we are embedding the
words into a real-valued low-dimensional space
14. Obstacles?
• It is almost impossible to come up with possible ‘meaning’ dimensions.
• For our setup, we cannot make assumption about corpus.
• When we don’t know the dimensions explicitly, can we still learn the word
vectors?
• We have large text collections, but very less labelled data.
Ref : http://www.changingmindsonline.com/wp-content/uploads/2014/12/roadblock.jpg
16. Neural Nets for word vectors
• Neural networks will automatically discover useful features in the
data, given a specific task.
• Let’s allocate a number of parameters for each word and allow the
neural network to automatically learn what the useful values should
be.
• But neural nets are supervised. How do we discover a pair of feature
and label?
17. Fill in the blanks
• I ________ at my desk
[read/study/sit]
• A _______ climbing a tree
[cat/bird/snake/man] [table/egg/car]
• ________ is the capital of ________.
Model context | word pair as [feature | label] pair and feed it to the
neural network.
18. Any theoretical confirmation?
“You shall know a word by the company it keeps”
In simple words: Co-occurrence is a good indicator of meaning.
Even simpler: Two words are considered close if they occur in similar
context. Context is the surrounding words.
(bug, insect) -> crawl, squash, small, wild, forest
(doctor, surgeon) -> operate, scalpel, medicine
the idea that children can figure out how to use words they've rarely
encountered before by generalizing about their use from distributions
of similar words.
John Rupert Firth (1957)
https://en.wikipedia.org/wiki/John_Rupert_Firth
19. Multiple contexts…
1. Can you cook some ________ for me?
2. _______ is so delicious.
3. _______ is not as healthy as fresh vegetables.
4. _______ was recently banned for some period of time.
5. _______ , Nestle brand is very popular with kids.
Ref: https://www.ourncr.com/blog/wp-content/uploads/2016/06/top-maggi-points-in-delhi.jpg
20. Word2Vec
• Simple neural nets can be used to obtain distributed representations of
words (Hinton et al, 1986; Elman, 1991;)
• The resulting representations have interesting structure – vectors can be
obtained using shallow network (Mikolov, 2007)
• Two target words are close and semantically related if they have many
common strongly co-occurring words.
• Efficient Estimation of Word Representations in Vector Space (Mikolov,
Chen, Corrado and Dean,2013)
• A popular tool for creating word embeddings. Available from Google
https://code.google.com/archive/p/word2vec/
21. Demo time
• word similarity
• word analogy
• odd man out problem
• You can try it out online http://bionlp-www.utu.fi/wv_demo/
• Found anything interesting?
22. Some words to try out!
Efficient Estimation of Word Representations in Vector Space (Mikolov, Chen, Corrado and Dean,2013)
23. Visualization of words in vector space
https://www.tensorflow.org/images/linear-relationships.png
24. About this talk
This is meant to be a short (2-3 hours), overview/summary style talk on
word2vec. I have tried to make it interesting by demo and real world
applications.
Full reference list at the end.
Comments/suggestions welcome: adwaitbhave@gmail.com
Slides and Full Code: https://github.com/yantraguru/word2vec_talk
25. Demystifying the algorithm
• How does a vector look?
• How does the underlying networks look?
• What are the vector properties?
• What are the limitations on the learning?
• What pre-processing is required?
• How we can control or tune the vectors?
26. More on word2vec
word2vec is not a single algorithm. It is a software package containing:
Two distinct models
• CBoW
• Skip-Gram
Various training methods
• Negative sampling
• Hierarchical softmax
A rich processing pipeline
• Dynamic Context Windows
• Subsampling
• Deleting Rare Words
Plus bunch of tricks: weighting of distant words, down-sampling of frequent words…
28. Skip-gram model
Predict the surrounding words, based on the current word
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
What are the context words?
the dog chased the cat
the dog chased the cat
….
Efficient Estimation of Word Representations in Vector Space (Mikolov, Chen, Corrado and Dean,2013)
29. Continuous Bag-of-Words (CBOW) model
Predict the current word, based on the surrounding
words
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
What are the context words?
the dog saw a cat
Efficient Estimation of Word Representations in Vector Space (Mikolov, Chen, Corrado and Dean,2013)
31. Word2vec network
• Lets assume we have 3
sentences.
“the dog saw a cat”,
“the dog chased the cat”,
“the cat climbed a tree”
• 8 words.
• Input and output is
one hot encoded.
• Weight matrices learnt
by backpropagation
Efficient Estimation of Word Representations in Vector Space (Mikolov, Chen, Corrado and Dean,2013)
32. Dynamic Context Windows
Marco saw a furry little dog hiding in the tree.
word2vec:
1
4
2
4
3
4
4
4
4
4
3
4
2
4
1
4
32
The probabilities that each specific context word will be included in the training data.
33. Negative Sampling
• we are instead going to randomly select just a small number of
“negative” words (let’s say 5) to update the weights for. (In this
context, a “negative” word is one for which we want the network to
output a 0 for).
• Essentially, the probability for selecting a word as a negative sample is
related to its frequency, with more frequent words being more likely
to be selected as negative samples.
34. Subsampling and Delete Rare Words
• There are two “problems” with common words like “the”:
For each word we encounter in our training text, there is a chance that
we will effectively delete it from the text. The probability that we cut
the word is related to the word’s frequency.
• Ignore words that are rare in the training corpus
• Remove these tokens from the corpus before creating context
windows
35. Word2vec Parameters
• skip-gram or CBOW
• window size (m)
• number of dimensions
• number of epochs
• negative sampling or hierarchical softmax
• negative samples (k)
• sampling distribution for negative samples(s)
• subsampling frequency
36. Demo and Code walkthrough
• Create word embeddings from corpus
• Examine the word similarities and other tasks
37. Word2Vec Key Ideas
• Achieve better performance not by using a more complex model(i.e. with more layers),
but by allowing a simpler (shallower) model to be trained on much larger amounts of
data.
• Meaning of new word can also be acquired just through reading (Miller and Charles,
1991)
• Use neural network and hidden layer of neural network is a feature detector
• Simple objective
• Few linguistic assumptions
• Implementation works without building / storing the actual matrix in memory.
• is very fast to train, can use multiple threads.
• can easily scale to huge data and very large word and context vocabularies.
39. Word2Vec Unanswered
• How do you generate vectors for unknown words? (Out-of-vocabulary
problem)
• How do you generate vectors for infrequent words?
• Non-uniform results
• Hard to understand and visualize (as word dimensions are derived by
using deep learning techniques)
41. Competition and State of the art…
• Pennington, Socher, and Manning (2014) GloVe: Global Vectors for
Word Representation
• Word embeddings can be composed from characters
• Generate embeddings for unknown words
• Similar spellings share similar embeddings
• Kim, Jernite, Sontag, and Rush (2015) Character-Aware Neural
Language Models
• Dos Santos and Zadrozny (2014) Learning Character-level
Representations for Part-of-Speech Tagging
42. More about word2vec
• Word embeddings are one of the most exciting area of research in deep learning.
• They provide a fresh perspective to ALL problems in NLP, and not just solve one
problem.
• much faster and way more accurate than previous neural net based solutions -
speed up of training compared to prior state of art (from weeks to seconds)
• Features derived from word2vec are used across all big IT companies in plenty of
applications.
• Very popular also in research community: simple way how to boost performance
in many NLP tasks
• Main reasons of success: very fast, open-source, easy to use the resulting
features to boost many applications (even non-NLP)
• Word2vec is successful because it is simple, but it cannot be applied everywhere
43. Pre-trained Vectors
• Word2vec is often used for pretraining. It will help your models start from an
informed position. Requires only plain text which we have a lot.
• Already pretrained vectors also available (trained on 100B words)
• However, for best performance it is important to continue training (fine-tuning).
• Raw word2vec vectors are good for predicting the surrounding words, but not
necessarily for your specific task.
• Simply treat the embeddings the same as other parameters in your model and
keep updating them during training.
• Google News dataset (~100 billion words)
• A common web platform with multiple datasets : http://www.wordvectors.org
44. Back to original Problem…
• How to find document vector?
• Summing it up?
• BOW – Bag of words?
• Clustering
• Recurrent Networks
• Convolutional Networks
• Tree-Structured Networks
• paragraph2Vec (2015)
45. Common sense approach
• Demo 1 : Stack overflow questions clustering using word2vec
• Demo 2 : Document clustering using pretrained word vectors.
• Code Walkthrough
46. Outline of the approach
Text pre-processing
• Replace newline/ tabs/ multiple spaces
• Remove punctuations (brackets, comma, semicolon, slash)
• Sentence tokenizer [Spacy]
• Join back sentences separated by newline
Text processing outputs text document containing a sentence on a line.
This structure ignores paragraph placement.
Load pre-trained word vectors.
• This is trained on the corpus if it's sufficiently large ~ 5000+ documents or
5000000+ tokens
• Or use open source model like Google News word2vec
47. Outline of the approach
Document vector calculation
• Tokenize sentences into terms.
• Filter terms
• Remove terms occurring in less than <20> documents
• Remove terms occurring in more than <85%> of the documents
• Calculate counts of all terms
For each word (term) of the document…………………[1]
• not a stop word
• Has vector associated
• Has survived count calculation
• If term satisfies criteria 1, get vector for the term from pretrained model………………..[2]
• Calculate weight of the term as log of count of the term frequency in the document. …..[3]
• Weighted vector = weight * vector
• Finally, document vector = sum of all weighted term vectors / no of terms
48. Document Clusters
Document Similarity : Cosine distance of
document vectors
Document Search : Query vector vs document
vectors
Document Browsing : Using semantic clusters
Topic Detection : Cluster signifies a topic.
Anomalous Document Detection: based on
inter cluster distance.
Classification : Supervised training based on
document vectors
49. References
• Mikolov (2012): Statistical Language Models Based on Neural Networks
• Mikolov, Yih, Zweig (2013): Linguistic Regularities in Continuous Space Word Representations
• Mikolov, Chen, Corrado, Dean (2013): Efficient estimation of word representations in vector space
• Mikolov, Sutskever, Chen, Corrado, Dean (2013): Distributed representations of words and phrases and their
compositionality
• Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs.
context-predicting semantic vectors
• Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation
• Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word
embeddings
Possible measures of similarity might take into consideration:
(a) The lengths of the documents
(b) The number of terms in common
(c) Whether the terms are common or unusual
(d) How many times each term appears
Each layer can apply any function you want to the previous layer to produce an output (usually a linear transformation followed by a squashing nonlinearity).
The hidden layer's job is to transform the inputs into something that the output layer can use.
The output layer transforms the hidden layer activations into whatever scale you wanted your output to be on.
Firth’s Distributional Hypothesis is the basis for statistical semantics. Although the Distributional Hypothesis originated in linguistics, it is now receiving attention in cognitive science especially regarding the context of word use. In recent years, the distributional hypothesis has provided the basis for the theory of similarity-based generalization in language learning: