Continuous representations of words and documents, which is recently referred to as Word Embeddings, have recently demonstrated large advancements in many of the Natural language processing tasks.
In this presentation we will provide an introduction to the most common methods of learning these representations. As well as previous methods in building these representations before the recent advances in deep learning, such as dimensionality reduction on the word co-occurrence matrix.
Moreover, we will present the continuous bag of word model (CBOW), one of the most successful models for word embeddings and one of the core models in word2vec, and in brief a glance of many other models of building representations for other tasks such as knowledge base embeddings.
Finally, we will motivate the potential of using such embeddings for many tasks that could be of importance for the group, such as semantic similarity, document clustering and retrieval.
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
An introduction to the Transformers architecture and BERTSuman Debnath
The transformer is one of the most popular state-of-the-art deep (SOTA) learning architectures that is mostly used for natural language processing (NLP) tasks. Ever since the advent of the transformer, it has replaced RNN and LSTM for various tasks. The transformer also created a major breakthrough in the field of NLP and also paved the way for new revolutionary architectures such as BERT.
Transformer modality is an established architecture in natural language processing that utilizes a framework of self-attention with a deep learning approach.
This presentation was delivered under the mentorship of Mr. Mukunthan Tharmakulasingam (University of Surrey, UK), as a part of the ScholarX program from Sustainable Education Foundation.
Word embedding, Vector space model, language modelling, Neural language model, Word2Vec, GloVe, Fasttext, ELMo, BERT, distilBER, roBERTa, sBERT, Transformer, Attention
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Financial Question Answering with BERT Language ModelsBithiah Yuan
FinBERT-QA is a Question Answering system for retrieving opinionated financial passages from task 2 of the FiQA dataset. The system uses techniques from both information retrieval, natural language processing, and deep learning.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
A Simple Explanation of the paper XLNet(https://arxiv.org/abs/1906.08237).
It would be helpful to get to grips with the concepts XLNet before you dive into the paper.
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
Review of paper
Language Models are Unsupervised Multitask Learners
(GPT-2)
by Alec Radford et al.
Paper link: https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
YouTube presentation: https://youtu.be/f5zULULWUwM
(Slides are written in English, but the presentation is done in Korean)
Abstractive text summarization is nowadays one of the most important research topics in NLP. However, getting a deep understanding of what it is and also how it works requires a series of base pieces of knowledge that build on top of each other. This is the reason why this presentation will give audiences an overview of sequence-to-sequence with the acceleration of various versions of attention over the past few years. In addition, natural language generation (NLG) with the focusing on decoder techniques and its relevant problems will be reviewed, as a supportive factor to the light of the success of automatic summarization. Finally, the abstractive text summarization will be represented with potential approaches to tackle some hot issues in some latest research papers.
Financial Question Answering with BERT Language ModelsBithiah Yuan
FinBERT-QA is a Question Answering system for retrieving opinionated financial passages from task 2 of the FiQA dataset. The system uses techniques from both information retrieval, natural language processing, and deep learning.
Introduction For seq2seq(sequence to sequence) and RNNHye-min Ahn
This is my slides for introducing sequence to sequence model and Recurrent Neural Network(RNN) to my laboratory colleagues.
Hyemin Ahn, @CPSLAB, Seoul National University (SNU)
BERT: Bidirectional Encoder Representation from Transformer.
BERT is a Pretrained Model by Google for State of the art NLP tasks.
BERT has the ability to take into account Syntaxtic and Semantic meaning of Text.
最近のNLP×DeepLearningのベースになっている"Transformer"について、研究室の勉強会用に作成した資料です。参考資料の引用など正確を期したつもりですが、誤りがあれば指摘お願い致します。
This is a material for the lab seminar about "Transformer", which is the base of recent NLP x Deep Learning research.
Presented by Ted Xiao at RobotXSpace on 4/18/2017. This workshop covers the fundamentals of Natural Language Processing, crucial NLP approaches, and an overview of NLP in industry.
Word embeddings have received a lot of attention since some Tomas Mikolov published word2vec in 2013 and showed that the embeddings that the neural network learned by “reading” a large corpus of text preserved semantic relations between words. As a result, this type of embedding started being studied in more detail and applied to more serious NLP and IR tasks such as summarization, query expansion, etc… More recently, researchers and practitioners alike have come to appreciate the power of this type of approach and have started a cottage industry of modifying Mikolov’s original approach to many different areas.
In this talk we will cover the implementation and mathematical details underlying tools like word2vec and some of the applications word embeddings have found in various areas. Starting from an intuitive overview of the main concepts and algorithms underlying the neural network architecture used in word2vec we will proceed to discussing the implementation details of the word2vec reference implementation in tensorflow. Finally, we will provide a birds eye view of the emerging field of “2vec" (dna2vec, node2vec, etc...) methods that use variations of the word2vec neural network architecture.
This (short) version of the Tutorial was presented at #AIWTB https://ai.withthebest.com/. See https://bmtgoncalves.github.io/word2vec-and-friends/ for further details on future (and longer) editions and sign up to http://tinyletter.com/dataforscience for related news and updates.
Le développement du Web et des réseaux sociaux ou les numérisations massives de documents contribuent à un renouvellement des Sciences Humaines et Sociales, des études des patrimoines littéraires ou culturels, ou encore de la façon dont est exploitée la littérature scientifique en général.
Les humanités numériques, qui croisent diverses disciplines avec l’informatique, posent comme centrales les questions du volume des données, de leur diversité, de leur origine, de leur véracité ou de leur représentativité. Les informations sont véhiculées au sein de « documents » textuels (livres, pages Web, tweets...), audio, vidéo ou multimédia. Ils peuvent comporter des illustrations ou des graphiques.
Appréhender de telles ressources nécessite le développement d'approches informatiques robustes, capables de passer à l’échelle et adaptées à la nature fondamentalement ambiguë et variée des informations manipulées (langage naturel ou images à interpréter, points de vue multiples…).
Si les approches d’apprentissage statistique sont monnaie courante pour des tâches de classification ou d’extraction d’information, elles doivent faire face à des espaces vectoriels creux et de dimension très élevées (plusieurs millions), être en mesure d’exploiter des ressources (par exemple des lexiques ou des thesaurus) et tenir compte ou produire des annotations sémantiques qui devront pouvoir être réutilisées.
Pour faire face à ces enjeux, des infrastructures ont été créées telle HumaNum à l’échelle nationale, DARIAH ou CLARIN à l’échelle européenne et des recommandations établies à l’échelle mondiale telle que la TEI (Text Encoding Initiative). Des plateformes au service de l’information scientifique comme l’équipement d’excellence OpenEdition.org sont une autre brique essentielle pour la préservation et l’accès aux « Big Digital Humanities » mais aussi pour favoriser la reproductibilité et la compréhension des expérimentations et des résultats obtenus.
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
Language technology is rapidly evolving. A resurgence in the use of distributed semantic representations and word embeddings, combined with the rise of deep neural networks has led to new approaches and new state of the art results in many natural language processing tasks. One such exciting - and most recent - trend can be seen in multimodal approaches fusing techniques and models of natural language processing (NLP) with that of computer vision.
The talk is aimed at giving an overview of the NLP part of this trend. It will start with giving a short overview of the challenges in creating deep networks for language, as well as what makes for a “good” language models, and the specific requirements of semantic word spaces for multi-modal embeddings.
At Return Path, we used a deep learning-inspired machine-learning algorithm called word2vec and the data in our Consumer Data Stream to find interesting relationships between email senders.
https://telecombcn-dl.github.io/2017-dlsl/
Winter School on Deep Learning for Speech and Language. UPC BarcelonaTech ETSETB TelecomBCN.
The aim of this course is to train students in methods of deep learning for speech and language. Recurrent Neural Networks (RNN) will be presented and analyzed in detail to understand the potential of these state of the art tools for time series processing. Engineering tips and scalability issues will be addressed to solve tasks such as machine translation, speech recognition, speech synthesis or question answering. Hands-on sessions will provide development skills so that attendees can become competent in contemporary data analytics tools.
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisAlex Polozov
Presented at SPLASH (OOPSLA) 2015.
Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with disruptive applications for automating repetitive tasks in end-user programming. However, designing, developing, and maintaining an effective industrial-quality inductive synthesizer is an intellectual and engineering challenge, requiring 1-2 man-years of effort.
Our novel observation is that many PBE algorithms are a natural fall-out of one generic meta-algorithm and the domain-specific properties of the operators in the underlying domain-specific language (DSL). The meta-algorithm propagates example-based constraints on an expression to its subexpressions by leveraging associated witness functions, which essentially capture the inverse semantics of the underlying operator. This observation enables a novel program synthesis methodology called _data-driven domain-specific deduction_ (D<sup>4</sup>), where domain-specific insight, provided by the DSL designer, is separated from the synthesis algorithm.
Our **FlashMeta** framework implements this methodology, allowing synthesizer developers to generate an efficient synthesizer from the mere DSL definition (if properties of the DSL operators have been modeled). In our case studies, we found that 10+ existing industrial-quality mass-market applications based on PBE can be cast as instances of D<sup>4</sup>. Our evaluation includes reimplementation of some prior works, which in FlashMeta become more efficient, maintainable, and extensible. As a result, FlashMeta-based PBE tools are deployed in several industrial products, including Microsoft PowerShell 3.0 for Windows 10, Azure Operational Management Suite, and Microsoft Cortana digital assistant.
Cancer cell metabolism: special Reference to Lactate PathwayAADYARAJPANDEY1
Normal Cell Metabolism:
Cellular respiration describes the series of steps that cells use to break down sugar and other chemicals to get the energy we need to function.
Energy is stored in the bonds of glucose and when glucose is broken down, much of that energy is released.
Cell utilize energy in the form of ATP.
The first step of respiration is called glycolysis. In a series of steps, glycolysis breaks glucose into two smaller molecules - a chemical called pyruvate. A small amount of ATP is formed during this process.
Most healthy cells continue the breakdown in a second process, called the Kreb's cycle. The Kreb's cycle allows cells to “burn” the pyruvates made in glycolysis to get more ATP.
The last step in the breakdown of glucose is called oxidative phosphorylation (Ox-Phos).
It takes place in specialized cell structures called mitochondria. This process produces a large amount of ATP. Importantly, cells need oxygen to complete oxidative phosphorylation.
If a cell completes only glycolysis, only 2 molecules of ATP are made per glucose. However, if the cell completes the entire respiration process (glycolysis - Kreb's - oxidative phosphorylation), about 36 molecules of ATP are created, giving it much more energy to use.
IN CANCER CELL:
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
Unlike healthy cells that "burn" the entire molecule of sugar to capture a large amount of energy as ATP, cancer cells are wasteful.
Cancer cells only partially break down sugar molecules. They overuse the first step of respiration, glycolysis. They frequently do not complete the second step, oxidative phosphorylation.
This results in only 2 molecules of ATP per each glucose molecule instead of the 36 or so ATPs healthy cells gain. As a result, cancer cells need to use a lot more sugar molecules to get enough energy to survive.
introduction to WARBERG PHENOMENA:
WARBURG EFFECT Usually, cancer cells are highly glycolytic (glucose addiction) and take up more glucose than do normal cells from outside.
Otto Heinrich Warburg (; 8 October 1883 – 1 August 1970) In 1931 was awarded the Nobel Prize in Physiology for his "discovery of the nature and mode of action of the respiratory enzyme.
WARNBURG EFFECT : cancer cells under aerobic (well-oxygenated) conditions to metabolize glucose to lactate (aerobic glycolysis) is known as the Warburg effect. Warburg made the observation that tumor slices consume glucose and secrete lactate at a higher rate than normal tissues.
Seminar of U.V. Spectroscopy by SAMIR PANDASAMIR PANDA
Spectroscopy is a branch of science dealing the study of interaction of electromagnetic radiation with matter.
Ultraviolet-visible spectroscopy refers to absorption spectroscopy or reflect spectroscopy in the UV-VIS spectral region.
Ultraviolet-visible spectroscopy is an analytical method that can measure the amount of light received by the analyte.
This pdf is about the Schizophrenia.
For more details visit on YouTube; @SELF-EXPLANATORY;
https://www.youtube.com/channel/UCAiarMZDNhe1A3Rnpr_WkzA/videos
Thanks...!
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Introduction:
RNA interference (RNAi) or Post-Transcriptional Gene Silencing (PTGS) is an important biological process for modulating eukaryotic gene expression.
It is highly conserved process of posttranscriptional gene silencing by which double stranded RNA (dsRNA) causes sequence-specific degradation of mRNA sequences.
dsRNA-induced gene silencing (RNAi) is reported in a wide range of eukaryotes ranging from worms, insects, mammals and plants.
This process mediates resistance to both endogenous parasitic and exogenous pathogenic nucleic acids, and regulates the expression of protein-coding genes.
What are small ncRNAs?
micro RNA (miRNA)
short interfering RNA (siRNA)
Properties of small non-coding RNA:
Involved in silencing mRNA transcripts.
Called “small” because they are usually only about 21-24 nucleotides long.
Synthesized by first cutting up longer precursor sequences (like the 61nt one that Lee discovered).
Silence an mRNA by base pairing with some sequence on the mRNA.
Discovery of siRNA?
The first small RNA:
In 1993 Rosalind Lee (Victor Ambros lab) was studying a non- coding gene in C. elegans, lin-4, that was involved in silencing of another gene, lin-14, at the appropriate time in the
development of the worm C. elegans.
Two small transcripts of lin-4 (22nt and 61nt) were found to be complementary to a sequence in the 3' UTR of lin-14.
Because lin-4 encoded no protein, she deduced that it must be these transcripts that are causing the silencing by RNA-RNA interactions.
Types of RNAi ( non coding RNA)
MiRNA
Length (23-25 nt)
Trans acting
Binds with target MRNA in mismatch
Translation inhibition
Si RNA
Length 21 nt.
Cis acting
Bind with target Mrna in perfect complementary sequence
Piwi-RNA
Length ; 25 to 36 nt.
Expressed in Germ Cells
Regulates trnasposomes activity
MECHANISM OF RNAI:
First the double-stranded RNA teams up with a protein complex named Dicer, which cuts the long RNA into short pieces.
Then another protein complex called RISC (RNA-induced silencing complex) discards one of the two RNA strands.
The RISC-docked, single-stranded RNA then pairs with the homologous mRNA and destroys it.
THE RISC COMPLEX:
RISC is large(>500kD) RNA multi- protein Binding complex which triggers MRNA degradation in response to MRNA
Unwinding of double stranded Si RNA by ATP independent Helicase
Active component of RISC is Ago proteins( ENDONUCLEASE) which cleave target MRNA.
DICER: endonuclease (RNase Family III)
Argonaute: Central Component of the RNA-Induced Silencing Complex (RISC)
One strand of the dsRNA produced by Dicer is retained in the RISC complex in association with Argonaute
ARGONAUTE PROTEIN :
1.PAZ(PIWI/Argonaute/ Zwille)- Recognition of target MRNA
2.PIWI (p-element induced wimpy Testis)- breaks Phosphodiester bond of mRNA.)RNAse H activity.
MiRNA:
The Double-stranded RNAs are naturally produced in eukaryotic cells during development, and they have a key role in regulating gene expression .
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Ana Luísa Pinho
Functional Magnetic Resonance Imaging (fMRI) provides means to characterize brain activations in response to behavior. However, cognitive neuroscience has been limited to group-level effects referring to the performance of specific tasks. To obtain the functional profile of elementary cognitive mechanisms, the combination of brain responses to many tasks is required. Yet, to date, both structural atlases and parcellation-based activations do not fully account for cognitive function and still present several limitations. Further, they do not adapt overall to individual characteristics. In this talk, I will give an account of deep-behavioral phenotyping strategies, namely data-driven methods in large task-fMRI datasets, to optimize functional brain-data collection and improve inference of effects-of-interest related to mental processes. Key to this approach is the employment of fast multi-functional paradigms rich on features that can be well parametrized and, consequently, facilitate the creation of psycho-physiological constructs to be modelled with imaging data. Particular emphasis will be given to music stimuli when studying high-order cognitive mechanisms, due to their ecological nature and quality to enable complex behavior compounded by discrete entities. I will also discuss how deep-behavioral phenotyping and individualized models applied to neuroimaging data can better account for the subject-specific organization of domain-general cognitive systems in the human brain. Finally, the accumulation of functional brain signatures brings the possibility to clarify relationships among tasks and create a univocal link between brain systems and mental functions through: (1) the development of ontologies proposing an organization of cognitive processes; and (2) brain-network taxonomies describing functional specialization. To this end, tools to improve commensurability in cognitive science are necessary, such as public repositories, ontology-based platforms and automated meta-analysis tools. I will thus discuss some brain-atlasing resources currently under development, and their applicability in cognitive as well as clinical neuroscience.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
1. Word Embeddings: Why
the Hype ?
Hady Elsahar
Hady.elsahar@univ-st-etienne.fr
slides available at :
2. Introduction
● Why vector for Natural language ?
● Convensional representations for words and documents
● Methods of Dimensionality reduction
Deep learning models:
● Continuous Bag of words model
● Other Models (SKip Gram Model, GloVe)
● Evaluation of Word Vectors
● Readings and references
3. Introduction: Why Vectors
Document Classification or Clustering :
● Documents composed of words
● Similar documents will contain similar words
● Machine Learning love vectors
● A Machine Learning algorithm shall know
which words are significant which category
4. Bag of Words Model
“Represent each document which the bag of words it contains”
d1 : Mary loves Movies, Cinema and Art Class 1 : Arts
d2 : John went to the Football game Class 2 : Sports
d3 : Robert went for the Movie Delicatessen Class : Arts
Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for
d1 1 1 1 1 1 1
d2 1 1 1 1 1 1
d3 1 1 1 1 1
5. Bag of Words Model
Can a Machine learning algorithm know that “the” and “for” are un important
words ?
● Yes : But will need lots of training labeled data
What to do ?
● Use hand crafted features (weighting features for words)
● Make lots of them
● Keep doing this for 50 years
● Regret later .. cry hard
6. Bag of Words Model + Weghting eiFeatures
Weighting features example TF-IDF
● TF-IDF ~= Term Frequency / Document frequency
● Motivation : Words appearing in large number of documents are not
significant
Mary Loves Movies Cinema Art John Went to the Delicatessen Robert Football Game and for
d1 0.3779 0.3779 0.3779 0.3779 0.3779 0.0001
d2 0.4402 0.001 0.02 0.4558 0.458
d3 0.001 0.01 0.01 0.458 0.0001
7. Word Vector Representations
Document can be represented by words, But how to represent words
themselves ?
“You shall know a word by the
company it keeps”
8. Word Vector Representations
Use a sliding window over a big corpus of text and count word co-occurences in
between.
1. I enjoy flying.
2. I like NLP.
3. I like deep learning.
9. Bag of words Representations: Drawbacks
● High dimensionality and Very sparse !!!!!
● Unable to capture word order
○ “ good but expensive” “expensive but good” will have same representation.
● Unable to capture semantic similarities (mostly because of sparsity)
○ “boy”, “girl” and “car”
○ “Human”, “Person” and “Giraffe”
10. Bag of words Representations: Drawbacks
How to over come this ?
● Keep using hand crafted features
● Make lots of them
● Keep doing this for 50 years
● Regret later .. cry hard
Or … Dimensionality reduction
12. Singular value decomposition
● Lower dimensionality K << |V|
● taking the most significant projection of your vectors
space
13. Latent semantic Indexing / Analysis (1994)
⋃ : are dense word vector representations
V : are dense Document vector representations
LSA / LSI , HAL methods made huge advancements in document retrieval and
semantic similarity
14. Deep learning Word Embeddings (2003)
“A Neural Probabilistic Language Model” Bengio et al. 2003
Original task “Language Modeling” :
- Prediction of next word given sequence of previous words.
- Useful in Speech Recognition, Autcompletion, Machine translation.
“The Cat Chills on a mat ” , Calculate : P( mat | the, cat, chills, on, a )
15. Deep learning Word Embeddings (2003)
“A Neural Probabilistic Language Model” Bengio et al. 2003
Quoting from the paper:
“This is intrinsically difficult because of the curse of dimensionality: a word
sequence on which the model will be tested is likely to be different from all the
word sequences seen during training.”
“We propose to fight the curse of dimensionality by learning a distributed
representation for words”
16. Continuous Bag of Words model (CBOW)
Tomas Mikolov et al. (2013)
The model Predicts the current word given the context
scan text in large corpus with a window
Input : x0
, x1
, x3
, x4
output : x2
“ The Cat Chills on a mat ”
x0
x1
x2
x3
x4
x5
17. Continuous Bag of Words model (CBOW)(2013)
| V | vocabulary size
Χi
∈ R 1 x | V |
1 hot vector representation of each word
yi
∈ R| V | x 1
one hot representation of the correct middle word (expected output)
1 0 0 0 0 0yi
0 0 0 0 1 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 1
0 0 1 0 0 0 0
x0
x1
x3
x4
| V |
Black box
18. Continuous Bag of Words model (CBOW)(2013)
| V | vocabulary size
Χi
∈ R 1 x | V |
1 hot vector representation of each word
yi
∈ R| V | x 1
one hot representation of the correct middle word (expected output)
yi
x0
x1
x3
x4
W(1)
Average
W(2) softmax
19. Continuous Bag of Words model (CBOW)(2013)
n arbitary length of our word embeddings
W(1)
∈ Rn × |V|
Input word vector
ui
∈ R n x 1
Representation of Xi
After multiplication with input matrix
0 0 0 0 1 0 0
0 0 0 1 0 0 0
0 0 0 0 0 0 1
0 0 1 0 0 0 0
x0
x1
x3
x4
| V |
0 1 3
1 3 6
5 0 3
9 8 0
2 2 2
5 6 7
8 8 8
|V|
n
W(1)
2 2 2
9 8 0
8 8 8
5 0 3
u0
u1
u3
u4
n
20. Continuous Bag of Words model (CBOW)(2013)
hi
∈ R n x 1
hi
= Average of u0
u1
u3
u4
2 2 2
9 8 0
8 8 8
5 0 3
u0
u1
u3
u4
n
Average
20.25 4.5 3.25
hi
21. Continuous Bag of Words model (CBOW)(2013)
W (2)
∈ R n x | V |
Output word vector
Z ∈ R | V | x 1
Output vector representation of Xi
Z = hi
W(2)
| V |
W(2)
0 1 3 1 3 6 5
0 3 9 8 0 2 2
2 5 6 7 8 8 8
n 32 14 23 0.22 12 14 55 19
Z
| V |
20.25 4.5 3.25
hi
22. Continuous Bag of Words model (CBOW)(2013)
How to compare Z to yi
?
Largest value corresponds to the correct class ? … no Softmax
Softmax: squashes a K-dimensional vector of arbitrary real values to
a K-dimensional vector of real values in the range (0, 1)
1 0 0 0 0 0 0 0yi
32 14 23 0.22 2 14 55 19Z
23. Continuous Bag of Words model (CBOW)(2013)
y^ = softmax ( Z )
yi
∈ R| V | x 1
one hot representation of the correct middle word
1 0 0 0 0 0 0 0yi
32 14 23 0.22 2 14 55 19Z
y^ 0.7 0.1 0.02 0.08 0 0 0.1
24. Continuous Bag of Words model (CBOW)(2013)
● We need estimated words y^ to be closest to the original answer
● One common error function is the cross entropy H(yˆ, y) (why ?).
Since y is one hot vector
25. Continuous Bag of Words model (CBOW)(2013)
● We need estimated words y^ to be closest to the original answer
● One common error function is the cross entropy error H(yˆ, y) (why ?).
Since y is one hot vector
26. Continuous Bag of Words model (CBOW)(2013)
Perfect language model will expect the propability of the correct word y^i
= 1
So loss will be 0
Optimization task :
● Learn W(1)
and W(2)
to minimize the cost function over all the dataset.
● using back propagation, update weights in W(1)
and W(2)
27. Continuous Bag of Words model (CBOW)(2013)
0 1 3
1 3 6
5 0 3
9 8 0
2 2 2
5 6 7
8 8 8
|V|
n
W(1)
W (1)
:
● After training over a large corpus
● Each row represents a dense vector for each word in the
vocabulary
● These word vectors contains better semantic and syntactic
representation than other dense vectors ( will be proven later)
● These word vectors performs better for all NLP tasks (will be
proven later)
29. GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Motivation:
ice - steam = ( solid, gas, water, fashion ) ?
● A distributional model should capture words that
appears with “ice” but not “steam”.
● Hence, doing well in semantic analogy task (explained
later)
30. GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Starts from a co-oocurrrence matrix
p(solid | ice ) = Xsolid,ice
/ Xice
31. GloVe: Global Vectors for Word
Representation, Pennington et al. (2014)
Optimize the Objective function:
wi
word vector of word i
Pik
probability of word k to occurs in context of word i
32. Ok, But are word vectors really good ?!
Evaluation of word vectors :
1. Intrinsic evaluation : make sure it encodes semantic
information
2. Extrinsic evaluation : make sure it’s useful for other NLP
tasks (the hype)
34. Intrinsic Evaluation of Word Vectors
Results from : GloVe: Global Vectors for Word Representation, Pennington et al 2014.
Word similarity dataset “WS353”: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/
Word similarity task
38. Extrinsic Evaluation of Word Vectors
Part of Speech Tagging :
input : Word Embeddings are cool
output: Noun Noun Verb Adjective
Named Entity recognition :
input : Nous sommes charlie hebdo
output: Out Out Person Person
39. Extrinsic Evaluation of Word Vectors
* systems: POS: (Toutanova et al. 2003), NER: (Ando & Zhang 2005)
** 130,000-word embedding trained on Wikipedia and Reuters with 11 word window, 100 unit hidden
layer – for 7 weeks! – then supervised task training
*** Features are character suffixes for POS and a gazeteer for NER
40. “Unsupervised Pretraining”
(the secret sauce)
Problem:
1. Task T1
: Few training data (D1
)
2. Hand crafted Feature representation of inputs R1
3. Machine learning Algorithm M1
on T1
using R1
performs bad
Solution:
1. Create Task T2
: With lots of available training data (D2
)
(unsupervised) but has to have the same input as T1
2. Solve T2
using (D2
) and learn representation of the inputs (R2
)
3. R2
+ M1
better than R1
+ M1
on task T1
42. Even better results !!
* Same architecture as C&W 2011, but word embeddings are kept constant during the supervised
training phase
** C&W is unsupervised pre-train + supervised NN + features model of last slide
44. Other word embeddings :
● Dependency Based Word embeddings: Levy et al. 2014 : http://www.aclweb.org.....
● Sentiment Analysis Word Embeddings: http://ai.stanford.edu/~ang/pap.....
Knowledge base embeddings :
● Structured Embeddings (SE) (Bordes et al ‘11 )
● Collective Matrix Factorization (RESCAL) (Nickel et al., ’11)
● Neural Tensor Networks (socher et al. ‘13)
● TATEC (Garcia-Duran et al., ’14)
Other Types of Embeddings:
45. Joint embeddings (Text + Knowledge bases):
● Joint Learning of Words and Meaning Representations (Bordes et al. ‘12)
● Knowledge Graph and Text Jointly Embedding (Wang et al ‘14)
Other Types of Embeddings:
46. References:
Before Word2Vec:
Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors."
Cognitive modeling 5 (1988): 3.
http://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf
Bengio, Yoshua, et al. "A neural probabilistic language model." The Journal of Machine Learning Research 3 (2003): 1137-
1155.
http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf
47. References:
Word2vec (CBOW and Skip Gram):
Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).
Efficient Estimation of Word Representations in Vector Space.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean.
Distributed Representations of Words and Phrases and their Compositionality.
In Proceedings of NIPS, 2013.
Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. Linguistic Regularities in Continuous Space Word Representations. In
Proceedings of NAACL HLT, 2013.
GloVe: Global Vectors for Word Representation, Pennington et al.(2014) http://www-nlp.stanford.edu/pubs/glove.pdf
48. Further Readings:
Negative sampling: http://papers.nips.cc/paper/....
Energy based learning : http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf
Joint learning (learning tasks simultaneously): http://ronan.collobert.com/pub...
49. Learning Resources
Deep Learning for NLP ( Stanford Course )
http://cs224d.stanford.edu/
Deep Learning for Natural Language Processing (without Magic : NAACL 2013 Tutorial
http://nlp.stanford.edu/courses/NAACL2013/