This paper improves the skip-gram model for learning word and phrase embeddings. It proposes using phrases instead of words as training samples, which greatly reduces the number of samples. It also introduces subsampling frequent words to speed up training and improve performance on infrequent words. Further, it compares different methods for reducing computational complexity like hierarchical softmax and negative sampling, finding negative sampling works best. Empirical tests on analogical reasoning tasks show the best model uses hierarchical softmax with subsampling and is trained on billions of words. The paper also demonstrates additive compositionality of word vectors.
The learner ultimate aim of learning a language through a language course is to achieve
fluent control of the sounds, spelling, vocabulary, grammar and discourse features of the language,
so that they can be used to communicate effectively (Nation & Goh, 2009). To ensure that these
goals are optimally attained, a language instructor should design a well balanced course enabling
the learners to be competent at the four language skills (listening, speaking, reading and writing)
and to be as fluent and accurate as possible in using them. Therefore, a language teacher should take
into account of balancing and integrating of the four strands, language learning opportunities, in
their language courses. These strands are called meaning-focused INPUT, meaning-focused
OUTPUT, language-focused LEARNING, and FLUENCY development
I discuss the basics of corpus linguistics, the application of corpus linguistics on linguistic studies and second language learning, as well as some freely available corpus linguistics resources for beginner corpus linguists.
Citation: Zubaidi, N. (2021). Corpus linguistics: An introduction. UM de Universe 2021. doi: 10.13140/RG.2.2.25479.11683
The learner ultimate aim of learning a language through a language course is to achieve
fluent control of the sounds, spelling, vocabulary, grammar and discourse features of the language,
so that they can be used to communicate effectively (Nation & Goh, 2009). To ensure that these
goals are optimally attained, a language instructor should design a well balanced course enabling
the learners to be competent at the four language skills (listening, speaking, reading and writing)
and to be as fluent and accurate as possible in using them. Therefore, a language teacher should take
into account of balancing and integrating of the four strands, language learning opportunities, in
their language courses. These strands are called meaning-focused INPUT, meaning-focused
OUTPUT, language-focused LEARNING, and FLUENCY development
I discuss the basics of corpus linguistics, the application of corpus linguistics on linguistic studies and second language learning, as well as some freely available corpus linguistics resources for beginner corpus linguists.
Citation: Zubaidi, N. (2021). Corpus linguistics: An introduction. UM de Universe 2021. doi: 10.13140/RG.2.2.25479.11683
Describing Teachers Harmer chapter 5 , E4IELTS Council
Describing Teachers (Harmer)
What is a teacher?
Teachers say they are like an actor, orchestral conductor, or gardener
Mehdi Sufi
t.m/IELTS_Council
t.m/IELTS_Council
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Describing Teachers Harmer chapter 5 , E4IELTS Council
Describing Teachers (Harmer)
What is a teacher?
Teachers say they are like an actor, orchestral conductor, or gardener
Mehdi Sufi
t.m/IELTS_Council
t.m/IELTS_Council
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Chunker Based Sentiment Analysis and Tense Classification for Nepali Textkevig
The article represents the Sentiment Analysis (SA) and Tense Classification using Skip gram model for the word to vector encoding on Nepali language. The experiment on SA for positive-negative classification is carried out in two ways. In the first experiment the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP) classification and it is observed that the F1 score of 0.6486 is achieved for positive-negative classification with overall accuracy of 68%. Whereas in the second experiment the verb chunks are extracted using Nepali parser and carried out the similar experiment on the verb chunks. F1 scores of 0.6779 is observed for positive -negative classification with overall accuracy of 85%. Hence, Chunker based sentiment analysis is proven to be better than sentiment analysis using sentences. This paper also proposes using a skip-gram model to identify the tenses of Nepali sentences and verbs. In the third experiment, the vector representation of each sentence is generated by using Skip-gram model followed by the Multi-Layer Perceptron (MLP)classification and it is observed that verb chunks had very low overall accuracy of 53%. In the fourth experiment, conducted for Tense Classification using Sentences resulted in improved efficiency with overall accuracy of 89%. Past tenses were identified and classified more accurately than other tenses. Hence, sentence based tense classification is proven to be better than verb Chunker based sentiment analysis.
Derric A. Alkis C
Abstract:
Delivering the customer to a high degree of confidence and the seller for more information about the products and the desire of customers through the use of modern technology and Machine Learning through comments left on the product to see and evaluate the comments added later and thus evaluate the product, whether good or bad.
Sentence Validation by Statistical Language Modeling and Semantic RelationsEditor IJCATR
This paper deals with Sentence Validation - a sub-field of Natural Language Processing. It finds various applications in
different areas as it deals with understanding the natural language (English in most cases) and manipulating it. So the effort is on
understanding and extracting important information delivered to the computer and make possible efficient human computer
interaction. Sentence Validation is approached in two ways - by Statistical approach and Semantic approach. In both approaches
database is trained with the help of sample sentences of Brown corpus of NLTK. The statistical approach uses trigram technique based
on N-gram Markov Model and modified Kneser-Ney Smoothing to handle zero probabilities. As another testing on statistical basis,
tagging and chunking of the sentences having named entities is carried out using pre-defined grammar rules and semantic tree parsing,
and chunked off sentences are fed into another database, upon which testing is carried out. Finally, semantic analysis is carried out by
extracting entity relation pairs which are then tested. After the results of all three approaches is compiled, graphs are plotted and
variations are studied. Hence, a comparison of three different models is calculated and formulated. Graphs pertaining to the
probabilities of the three approaches are plotted, which clearly demarcate them and throw light on the findings of the project.
Discovering Novel Information with sentence Level clustering From Multi-docu...irjes
Specific objective to discover some novel information from a set of documents initially retrieved in response to some query. Clustering sentences level text, effective use and update is still an open research issue, especially in domain of text mining. Since most existing system uses pattern belong to a single cluster. But here we can use patterns belongs to all cluster with different degree of membership. Since sentences of those documents we would expect at least one of the clusters to be closely related to the concepts described by the query term. This paper presents a Novel Fuzzy Clustering Algorithm that operates on relational input data (i.e. data in the form of square matrix of pair wise similarities between data objects).
Seeds Affinity Propagation Based on Text ClusteringIJRES Journal
The objective is to find among all partitions of the data set, best publishing according to some quality measure. Affinity propagation is a low error, high speed, flexible, and remarkably simple clustering algorithm that may be used in forming teams of participants for business simulations and experiential exercises, and in organizing participant’s preferences for the parameters of simulations. This paper proposes an efficient Affinity Propagation algorithm that guarantees the same clustering result as the original algorithm after convergence. The heart of our approach is (1) to prune unnecessary message exchanges in the iterations and (2) to compute the convergence values of pruned messages after the iterations to determine clusters.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
1. A Summary for Distributed Representations of
Words and Phrases and Their Compositionality
Xiangnan YUE
xiangnanyue@gmail.com
Ingénieur Civil @Mines ParisTech
M2 Data Science @Ecole Polytechnique
2. Road Map
- Part 1 - Learning Word2Vec
- Baseline Skip-gram model
- Hierarchical softmax
- Negative sampling
- Sub-sampling trick
- Part 2 - Learning Phrase
- Part 3 - Empirical Results
- Part 4 - Additive Compositionality
- Part 5 - Conclusion
3. Background
This paper follows the paper [1] Efficient estimation of word representations in
vector space in 2013.
This paper proposed some important improvements for the Skip-gram model,
which is widely used for obtaining the famous “word vectors”. [1] proposed the
CBOW (Continuous Bag-of-words Model) and Skip-gram, using a 3-layer neural
network to train the word-vectors.
While CBOW is entering the context to predict the center, Skip-gram is entering
the center word to predict the context.
5. The principle for the baseline Skip-gram
- Inputting a centered word, the model will be able to
give us the conditional probabilities of surrounding
words. The goal is to get the hidden layer “word
vectors” by maximizing the likelyhood function.
- The objective function is to maximize a log-
likelyhood function :
- No activation function for the hidden layer, and the
output neuron used softmax function.
- The parameters of the hidden layer are trained as
the “word vectors”. It is the matrix of shape (m, n),
where m is the total number of words, and n is the
feature number (vector length)
Figure 1. Architecture of a basic skip-gram, the graph
referred from
http://mccormickml.com/2016/04/19/word2vec-tutorial-
the-skip-gram-model/
6. Remarks for Baseline Skip-gram
- The objective function is under the Naive Bayesian
assumption that the features w(t+j) and w(t+i) are independent knowing w(t)
so that p(w(t+j) | w(t)) can be separated by log(ᐧ) operation.
- The parameters for the hidden layer are too large to be trained. In the
following formula, for any input word vector , we still have to calculate the
inner product with every output vector (between the hidden layer and
the softmax) to get the output probability distribution for each word wo. The
parameter matrix between the output and the hidden layer is of the same size
of our word-vector matrix (m, n). It’s not practical when the vocabulary table
size m is too large.
7. - Besides, in order to train such a large volume of parameters, training data
volume also has to be large -- much slower as the running time is proportional
to the multiplication of the two volumes.
- Therefore, other extensions of Skip-gram were proposed, such as the
Hierarchical Softmax and the Negative Sampling (one contribution of this
paper).
- Stochastic Gradient Descent was proposed to train the Skip-gram.
Remarks for Baseline Skip-gram
9. In order to reduce the complexity of calculating the weights of output layer.
Hierarchical Softmax (HS) was used in the Skip-gram model. HS was firstly
proposed by [2] Morin and Bengio (2005).
We will introduce the data structure used in HS: Huffman Tree / Coding. Huffman
tree assigns short codes (path) to the most frequent words, and thus the random
walk on the tree will find these words quickly.
The Hierarchical Softmax
10. Huffman Tree
Huffman tree was initially proposed for lossless
data compression. It is constructed by the
frequency table of words in the corpus. It’s a
method of bottom-up.
Words are positioned as the leaf (prefix coding).
The most often appeared words are close to the
root and the most infrequent words are close to
the bottom.
The decoding path starts from the root and stops
when word w is found (reaches the leaf).
Figure 2. Huffman Tree : an
example. Graph referred from :
https://en.wikipedia.org/wiki/Huffman_coding
11. Huffman Tree
Here we use |w| to represent the total number of words, it’s the same as the
number of leafs of Huffman Tree, and |f| to represent the number of features on
the hidden layer, and denotes sigmoid function
Huffman Tree reduces the computation complexity on the output layer from
to approximate for one input sample ! The softmax
function is replaced by:
This equation explains the process: we first choose ch() to be fixed as the left child
node. Then for any chosen w, if at node j the path goes to left, we take ,
otherwise take (1 - ). The path from root to w leads to a
maximum likelyhood problem. By backpropogation, the parametres are updated.
12. The number of parameters rest the same yet a binary Huffman Tree reduced the
computation volume to an acceptable level.
Different methods for constructing the tree structure can be found by [3] Mnih and
Hinton (2009).
HS was proposed in paper [1] a priori and the performance is not satisfying. This
paper compared the results of Hierarchical Softmax with Negative Sampling, and
also Noise Contrastive Estimation - NCE proposed by [4] Gutmann and
Hyvarinen (2012), together with the subsampling trick and using phrase as
training samples.
Huffman Tree
14. Negative Sampling Method
The principle is that each sample adjusts only a small part of the parametres. The
idea is selecting a few negative samples (those which are not around the input
centered word) and only updating the output layer weight of the selected negative
words and the positive word.
Remark:
- It’s a similar way to dropout layer in deep neural network, yet the dropout
layer is between the hidden layers, instead of removing the output units.
- In the hidden layer, the number of updated weights are not changed -- only
the feature weights of input words are to be changed.
15. Negative Sampling Method
Ways to select Negative Samples: Use the uni-gram distribution Pn(w) to select
the negative words. According to the paper, in the objective function,
is replaced by the following function
a ¾ power value is used with the uni-gram percentage. This value is empirical.
Note that is the sigmoid function: ,
thus the first term above is the positive word’s log probability and the second term
is the negative words’ non-observe log probability.
17. Subsampling Heuristic
Sub-sampling is important because some words are much more frequent than
others. Therefore when the sampling window size is large enough, some words
appear quite often in sampling windows yet doesn’t reflect any information (the
same goes for a variable with probability 1 doesn’t give any information as
entropy).
Heuristically, by removing the appearing rate of the same word while conserving
the ranking in the whole documents will even improve the final results. The
method used in this paper is a sub-sampling by a probability of saving which is of
the order .
The discard Probability : , t is called the
subsampling rate, and f(w) is the frequency of the word w in the corpus.
19. Use Phrases
Another important contribution is to use the phrases (or word pairs) rather than the
words in their training samples. Though this can be regarded as a prepared step,
it’s of great importance, given that:
- The samples volume is greatly reduced.
- The meaning of compositioned phrases are saved.
The scores for co-appearance are calculated simply by 2-gram / 1-gram. When
the score is large enough, two words are combined into one single phrase.
The whole training data set is passing 3-4 times this preprocessing algorithm. In
the paper, the obtained phrases’ vector quality is tested by another analogical
reasoning task (72% accuracy).
21. Analogical Reasoning Task
An example for analogical reasoning is
as left. Given three phrases, to how
much degree can the fourth phrase be
inferred from the first three phrase-
vectors. By using cosine distance
between the vectors, the closest answer
is selected and compared with the true
answer.
This test method is used also in [1].
Figure 3. Examples for Analogical Reasoning test : In
each cell, given the first three words, test the accuracy
of calculating the fourth.
22. Empirical Results (words, no phrase learning)
Using Analogical Reasoning Task, the empirical results is listed in this paper.
- Negative Sampling outperforms the Hierarchical Softmax, and slightly
better than the Noise Contrastive Estimation.
- The subsampling speeds up the algorithm and even got better results for
Semantic and Total accuracy.
dimensionality = 300;
context size = 5;
discard words with
frequence < 5;
k = 5, 15 (the number
of negative words);
# of words = billions
23. Empirical Results (using phrases)
Using Analogical Reasoning Task, the empirical results is listed in this paper.
- Using subsampling, Hierarchical Softmax outperforms the other methods !
Yet when no subsampling, HS got the worst accuracy.
- As using phrases largely reduced the total number training samples, more
training data is needed (to be continued...)
dimensionality = 300;
context size = 5;
discard words with
frequence < 5;
k = 5, 15 (the number
of negative words);
24. Empirical Results (using phrases)
The paper claims that when using 33 billion words, dimensionality = 1000, and
the entire sentence as context size, this resulted in an accuracy of 72%. The
best performed model is still HS + subsampling.
26. Additive Compositionality
Additive Compositionality is an interesting property of word vector. It shows that a
non-obvious degree of language understanding can be obtained by using basic
operations (sum here) on the word representation. That is something more than
analogical reasoning.
The author gave an intuitive explanation as follows (the product of distributions.)
28. Conclusion
This paper mainly has three contributions:
- Using common word pairs (phrase representations) to replace words in the
training model.
- Subsampling the frequent words to largely decrease training time and get
better performance for infrequent words.
- Proposing Negative Sampling
- Comparing different methods’ combination and trained a best model (up-to-
then) on a huge (30 billions words) data set.
- Addressing the additive compositionality of word vector.
The works of this paper can be further used in CBOW in [1], and the code is given
at https://code.google.com/archive/p/word2vec/
29. Referrence
[1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word
representations in vector space. ICLR Workshop, 2013.
[2] Frederic Morin and Yoshua Bengio. Hierarchical probabilistic neural network language model. In Pro-
ceedings of the international workshop on artificial intelligence and statistics, pages 246–252, 2005.
[3] Andriy Mnih and Geoffrey E Hinton. A scalable hierarchical distributed language model. Advances in
neural information processing systems, 21:1081–1088, 2009.
[4] Michael U Gutmann and Aapo Hyva ̈rinen. Noise-contrastive estimation of unnormalized statistical
mod- els, with applications to natural image statistics. The Journal of Machine Learning Research,
13:307–361, 2012.