Context-based movie search for user questions that ask the title of the movie using doc2vec, word2vec algorithm.
doc2vec, word2vec 알고리즘을 활용하여 제목이 기억이 나지 않는 영화를 찾는 질문의 문맥을 이용하여 원하는 영화를 찾아주는 내용입니다.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
(Data Day 2016)
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
AI & BigData Online Day 2021
Website - https://aiconf.com.ua/
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Cl...MODUL Technology GmbH
Transformer models have achieved state-of-the-art results for news classification tasks, but remain difficult to modify to yield the desired class probabilities in a multi-class setting. Using a neural topic model to create dense topic clusters helps with generating these class probabilities. The presented work uses the BERTopic clustered embeddings model as a preprocessor to eliminate documents that do not belong to any distinct cluster or topic. By combining the resulting embeddings with a Sentence Transformer fine-tuned with SetFit, we obtain a prompt-free framework that demonstrates competitive performance even with few-shot labeled data. Our findings show that incorporating BERTopic in the preprocessing stage leads to a notable improvement in the classification accuracy of news documents. Furthermore, our method outperforms hybrid approaches that combine text and images for news document classification.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
word2vec, LDA, and introducing a new hybrid algorithm: lda2vec👋 Christopher Moody
(Data Day 2016)
Standard natural language processing (NLP) is a messy and difficult affair. It requires teaching a computer about English-specific word ambiguities as well as the hierarchical, sparse nature of words in sentences. At Stitch Fix, word vectors help computers learn from the raw text in customer notes. Our systems need to identify a medical professional when she writes that she 'used to wear scrubs to work', and distill 'taking a trip' into a Fix for vacation clothing. Applied appropriately, word vectors are dramatically more meaningful and more flexible than current techniques and let computers peer into text in a fundamentally new way. I'll try to convince you that word vectors give us a simple and flexible platform for understanding text while speaking about word2vec, LDA, and introduce our hybrid algorithm lda2vec.
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
AI & BigData Online Day 2021
Website - https://aiconf.com.ua/
Youtube - https://www.youtube.com/startuplviv
FB - https://www.facebook.com/aiconf
A Simple Introduction to Word EmbeddingsBhaskar Mitra
In information retrieval there is a long history of learning vector representations for words. In recent times, neural word embeddings have gained significant popularity for many natural language processing tasks, such as word analogy and machine translation. The goal of this talk is to introduce basic intuitions behind these simple but elegant models of text representation. We will start our discussion with classic vector space models and then make our way to recently proposed neural word embeddings. We will see how these models can be useful for analogical reasoning as well applied to many information retrieval tasks.
Unsupervised Topic Modeling with BERTopic for Coarse and Fine-Grained News Cl...MODUL Technology GmbH
Transformer models have achieved state-of-the-art results for news classification tasks, but remain difficult to modify to yield the desired class probabilities in a multi-class setting. Using a neural topic model to create dense topic clusters helps with generating these class probabilities. The presented work uses the BERTopic clustered embeddings model as a preprocessor to eliminate documents that do not belong to any distinct cluster or topic. By combining the resulting embeddings with a Sentence Transformer fine-tuned with SetFit, we obtain a prompt-free framework that demonstrates competitive performance even with few-shot labeled data. Our findings show that incorporating BERTopic in the preprocessing stage leads to a notable improvement in the classification accuracy of news documents. Furthermore, our method outperforms hybrid approaches that combine text and images for news document classification.
https://telecombcn-dl.github.io/2018-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
In this part 1 presentation, I have attempted to provide a '30,000 feet view' of BERT (Bidirectional Encoder Representations from Transformer) - a state of the art Language Model in NLP with high level technical explanations. I have attempted to collate useful information about BERT from various useful sources.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Building prediction models with Amazon Redshift and Amazon Machine Learning -...Amazon Web Services
Mining data with Redshift, using this data to build a prediction model with Amazon ML, performing batch predictions & real-time predictions (with a Java app).
NLP with Deep Learning Guest Lecture slides by Fatih Mehmet Güler, PragmaCraft. Includes my background on the subject, our projects, the NLP stages and the latest developments.
Markov chains are a very common model for systems that change probablistically over time. We show a few fun examples, define the objects, state the main theorems, and show how to find the steady-state vector.
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
Presentation of research paper "A Benchmark for the Use of Topic Models for Text Visualization Tasks" at the 15th International Symposium on Visual Information Communication and Interaction in Chur, Switzerland.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
Today an increasingly large number of products use machine learning to deliver a great personalized user experience, and workplace software is no exception. Learn how Spoke uses MongoDB to do dynamic model training in real time from user interaction data and automatically train and serve thousands of models, with multiple customized models per client.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Building prediction models with Amazon Redshift and Amazon Machine Learning -...Amazon Web Services
Mining data with Redshift, using this data to build a prediction model with Amazon ML, performing batch predictions & real-time predictions (with a Java app).
NLP with Deep Learning Guest Lecture slides by Fatih Mehmet Güler, PragmaCraft. Includes my background on the subject, our projects, the NLP stages and the latest developments.
Markov chains are a very common model for systems that change probablistically over time. We show a few fun examples, define the objects, state the main theorems, and show how to find the steady-state vector.
Online Machine Learning: introduction and examplesFelipe
In this talk I introduce the topic of Online Machine Learning, which deals with techniques for doing machine learning in an online setting, i.e. where you train your model a few examples at a time, rather than using the full dataset (off-line learning).
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...Matthias Trapp
Presentation of research paper "A Benchmark for the Use of Topic Models for Text Visualization Tasks" at the 15th International Symposium on Visual Information Communication and Interaction in Chur, Switzerland.
Naver learning to rank question answer pairs using hrde-ltcNAVER Engineering
The automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
Among the QA sub-systems, we focused on answer-ranking part. In particular, we investigated a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model.
In this talk, I'll present our research results (NAACL 2018), and also its potential use cases (i.e. fake news detection). Finally, I'll conclude by introducing some issues on previous research, and by introducing recent approach in academic.
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB
Today an increasingly large number of products use machine learning to deliver a great personalized user experience, and workplace software is no exception. Learn how Spoke uses MongoDB to do dynamic model training in real time from user interaction data and automatically train and serve thousands of models, with multiple customized models per client.
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...kevig
This paper demonstrates that a lot of time, cost, and complexities can be saved and avoided that would
otherwise be used to label the text data for classification purposes. The AI world realizes the importance of
labelled data and its use for various NLP applications.
Here, we have labelled and categorized close to 6,000 unlabelled samples into five distinct classes. This
labelled dataset was further used for multi-class text classification.
Data labelling task using transformer-based sentence embeddings and applying cosine-based text similarity
threshold saved close to 20-30 days of human efforts and multiple human validations with 98.4% of classes
correctly labelled as per business validation. Text classification results obtained using this AI labelled data
fetched accuracy score and F1 score of 90%.
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...kevig
This paper demonstrates that a lot of time, cost, and complexities can be saved and avoided that would
otherwise be used to label the text data for classification purposes. The AI world realizes the importance of
labelled data and its use for various NLP applications.
Here, we have labelled and categorized close to 6,000 unlabelled samples into five distinct classes. This
labelled dataset was further used for multi-class text classification.
Data labelling task using transformer-based sentence embeddings and applying cosine-based text similarity
threshold saved close to 20-30 days of human efforts and multiple human validations with 98.4% of classes
correctly labelled as per business validation. Text classification results obtained using this AI labelled data
fetched accuracy score and F1 score of 90%.
DELAB - sequence generation seminar
Title
Open vocabulary problem
Table of contents
1. Open vocabulary problem
1-1. Open vocabulary problem
1-2. Ignore rare words
1-3. Approximative Softmax
1-4. Back-off Models
1-5. Character-level model
2. Solution1: Byte Pair Encoding(BPE)
3. Solution2: WordPieceModel(WPM)
Présentation de l'atelier Classification présenté lors de l'action de formation CNRS-INRAE en novembre 2021 (https://anf-tdm-2021.sciencesconf.org).
Code et données : https://github.com/pbellot/ANFTDM2021
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
f you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It’s not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term “daemon” in the domain of operating system articles is not a synonym of “devil” but it’s closer to the term “process”.
Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other.
Similar to Context-based movie search using doc2vec, word2vec (20)
Cracking the Workplace Discipline Code Main.pptxWorkforce Group
Cultivating and maintaining discipline within teams is a critical differentiator for successful organisations.
Forward-thinking leaders and business managers understand the impact that discipline has on organisational success. A disciplined workforce operates with clarity, focus, and a shared understanding of expectations, ultimately driving better results, optimising productivity, and facilitating seamless collaboration.
Although discipline is not a one-size-fits-all approach, it can help create a work environment that encourages personal growth and accountability rather than solely relying on punitive measures.
In this deck, you will learn the significance of workplace discipline for organisational success. You’ll also learn
• Four (4) workplace discipline methods you should consider
• The best and most practical approach to implementing workplace discipline.
• Three (3) key tips to maintain a disciplined workplace.
B2B payments are rapidly changing. Find out the 5 key questions you need to be asking yourself to be sure you are mastering B2B payments today. Learn more at www.BlueSnap.com.
Company Valuation webinar series - Tuesday, 4 June 2024FelixPerez547899
This session provided an update as to the latest valuation data in the UK and then delved into a discussion on the upcoming election and the impacts on valuation. We finished, as always with a Q&A
[Note: This is a partial preview. To download this presentation, visit:
https://www.oeconsulting.com.sg/training-presentations]
Sustainability has become an increasingly critical topic as the world recognizes the need to protect our planet and its resources for future generations. Sustainability means meeting our current needs without compromising the ability of future generations to meet theirs. It involves long-term planning and consideration of the consequences of our actions. The goal is to create strategies that ensure the long-term viability of People, Planet, and Profit.
Leading companies such as Nike, Toyota, and Siemens are prioritizing sustainable innovation in their business models, setting an example for others to follow. In this Sustainability training presentation, you will learn key concepts, principles, and practices of sustainability applicable across industries. This training aims to create awareness and educate employees, senior executives, consultants, and other key stakeholders, including investors, policymakers, and supply chain partners, on the importance and implementation of sustainability.
LEARNING OBJECTIVES
1. Develop a comprehensive understanding of the fundamental principles and concepts that form the foundation of sustainability within corporate environments.
2. Explore the sustainability implementation model, focusing on effective measures and reporting strategies to track and communicate sustainability efforts.
3. Identify and define best practices and critical success factors essential for achieving sustainability goals within organizations.
CONTENTS
1. Introduction and Key Concepts of Sustainability
2. Principles and Practices of Sustainability
3. Measures and Reporting in Sustainability
4. Sustainability Implementation & Best Practices
To download the complete presentation, visit: https://www.oeconsulting.com.sg/training-presentations
Digital Transformation and IT Strategy Toolkit and TemplatesAurelien Domont, MBA
This Digital Transformation and IT Strategy Toolkit was created by ex-McKinsey, Deloitte and BCG Management Consultants, after more than 5,000 hours of work. It is considered the world's best & most comprehensive Digital Transformation and IT Strategy Toolkit. It includes all the Frameworks, Best Practices & Templates required to successfully undertake the Digital Transformation of your organization and define a robust IT Strategy.
Editable Toolkit to help you reuse our content: 700 Powerpoint slides | 35 Excel sheets | 84 minutes of Video training
This PowerPoint presentation is only a small preview of our Toolkits. For more details, visit www.domontconsulting.com
Implicitly or explicitly all competing businesses employ a strategy to select a mix
of marketing resources. Formulating such competitive strategies fundamentally
involves recognizing relationships between elements of the marketing mix (e.g.,
price and product quality), as well as assessing competitive and market conditions
(i.e., industry structure in the language of economics).
Event Report - SAP Sapphire 2024 Orlando - lots of innovation and old challengesHolger Mueller
Holger Mueller of Constellation Research shares his key takeaways from SAP's Sapphire confernece, held in Orlando, June 3rd till 5th 2024, in the Orange Convention Center.
Buy Verified PayPal Account | Buy Google 5 Star Reviewsusawebmarket
Buy Verified PayPal Account
Looking to buy verified PayPal accounts? Discover 7 expert tips for safely purchasing a verified PayPal account in 2024. Ensure security and reliability for your transactions.
PayPal Services Features-
🟢 Email Access
🟢 Bank Added
🟢 Card Verified
🟢 Full SSN Provided
🟢 Phone Number Access
🟢 Driving License Copy
🟢 Fasted Delivery
Client Satisfaction is Our First priority. Our services is very appropriate to buy. We assume that the first-rate way to purchase our offerings is to order on the website. If you have any worry in our cooperation usually You can order us on Skype or Telegram.
24/7 Hours Reply/Please Contact
usawebmarketEmail: support@usawebmarket.com
Skype: usawebmarket
Telegram: @usawebmarket
WhatsApp: +1(218) 203-5951
USA WEB MARKET is the Best Verified PayPal, Payoneer, Cash App, Skrill, Neteller, Stripe Account and SEO, SMM Service provider.100%Satisfection granted.100% replacement Granted.
Building Your Employer Brand with Social MediaLuanWise
Presented at The Global HR Summit, 6th June 2024
In this keynote, Luan Wise will provide invaluable insights to elevate your employer brand on social media platforms including LinkedIn, Facebook, Instagram, X (formerly Twitter) and TikTok. You'll learn how compelling content can authentically showcase your company culture, values, and employee experiences to support your talent acquisition and retention objectives. Additionally, you'll understand the power of employee advocacy to amplify reach and engagement – helping to position your organization as an employer of choice in today's competitive talent landscape.
The world of search engine optimization (SEO) is buzzing with discussions after Google confirmed that around 2,500 leaked internal documents related to its Search feature are indeed authentic. The revelation has sparked significant concerns within the SEO community. The leaked documents were initially reported by SEO experts Rand Fishkin and Mike King, igniting widespread analysis and discourse. For More Info:- https://news.arihantwebtech.com/search-disrupted-googles-leaked-documents-rock-the-seo-world/
Recruiting in the Digital Age: A Social Media MasterclassLuanWise
In this masterclass, presented at the Global HR Summit on 5th June 2024, Luan Wise explored the essential features of social media platforms that support talent acquisition, including LinkedIn, Facebook, Instagram, X (formerly Twitter) and TikTok.
Business Valuation Principles for EntrepreneursBen Wann
This insightful presentation is designed to equip entrepreneurs with the essential knowledge and tools needed to accurately value their businesses. Understanding business valuation is crucial for making informed decisions, whether you're seeking investment, planning to sell, or simply want to gauge your company's worth.
Affordable Stationery Printing Services in Jaipur | Navpack n PrintNavpack & Print
Looking for professional printing services in Jaipur? Navpack n Print offers high-quality and affordable stationery printing for all your business needs. Stand out with custom stationery designs and fast turnaround times. Contact us today for a quote!
4. Text Mining Term Project 4
Introduction
People sometimes have a craving for find a movie that they once glimpsed. At that time,
they used to ask the movie name through Q&A sites and get the result. Answerers often
seems ‘god of movie’, so we want to imitate their prophecy.
Question Examples
5. Text Mining Term Project 5
Data Gathering
We chose one expert of this field and gather his answers.
Q&A Site http://kin.naver.com
Expert ID xedz****
Question & Answer data 39,758
Date 2012 December ~ 2018March
Unique Movie 5,900
Gathered Data Information
6. Text Mining Term Project 6
2 Types of Text Representation
There are 2 kinds of text representation: sparse and dense
Sparse: One-Hot Encoding Dense: Word Embedding
Sparse Dense
Dimension ▪ As many as unique words
• Autonomous setting
• Usually 20~200 dimensions
Information
• Lots of 0 value
• No Information
• Every element has value
• Abundant Information
Comparison of Text Representations
source: https://dreamgonfly.github.io/machine/learning,/natural/language/processing/2017/08/16/word2vec_explained.html
7. Text Mining Term Project 7
Main Idea of Word2Vec
Word2Vec is one of the word embedding methods.
Its main idea is “You shall know a word by the company it keeps.”
Every word has friends around them
8. Text Mining Term Project 8
Algorithms of Word2Vec
Word2vec has two model architectures: continuous-bag-of-words (CBOW), skip-gram.
Diagrams of CBOW and Skip-gram
source: https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/
9. Text Mining Term Project 9
Algorithms of Doc2Vec
Doc2vec has two model architectures: distributed memory model (PV-DM) and
Distributed bag of words model(PV-DBOW).
Diagrams of PV-DM and PV-DBOW
source: Distributed Representations of Sentences and Documents
The concatenation or average of vector with a context of three
words is used to predict the fourth word. The paragraph vector
represents the missing information from the current context
Ignore the context words in the input, but force the model to
predict words randomly sampled from the paragraph in the
output. Similar to Skip-gram model
PV-DM PV-DBOW
11. Text Mining Term Project 11
▪ Tokenize with KoNLPy
• using Twitter package
▪ Pos-tagging
• only get noun, verb, and adjective
▪ Remove Token which has only one
character
▪ Remove Stop-words
▪ Delete questions of which token length are
less than 10
▪ Remove unnecessary words
• URL, Special characters (!, ?, *, @, <. >),
Emoticon(ㅋㅋ, ㅠㅠ), multispacer
▪ Stem words that dictionary cannot correct
• (남주 → 남자주인공), (페북 → 페이스북),
(영환 → 영화인데), (여자애 → 여자)
▪ Delete unnecessory phrase in question
• 좀 옛날 영화인데 ~, 페북에서 봤는데, ~
장면이 있었는데 기억이안나네요
▪ Delete questions of which length are less
than 30
Preprocessing
We did preprocessing for better performance and it is processed by 2 steps: whole text
data and tokenized data.
Raw Preprocessing Tokenizing
12. Text Mining Term Project 12
Select Movies and Split dataset
There are 5,900 movies in dataset, but many movies has few questions. So we remove
certain movies that have questions below cutoff value. Then we split the dataset with
8:2 ratio to test the model.
Movie Count
스파이더위크가의 비밀 259
캐빈 인 더 우즈 222
비밀의 숲 테라비시아 179
Cutoff
무서운 영화 2 1
전우 1
전우치 1
The number of question per movies
Movie Train Test
스파이더위크가의 비밀 207 52
케빈 인 더 우즈 177 45
비밀의 숲 테라비시아 143 36
레모니 스니켓의 위험한 대결 142 36
플립 141 36
… … …
Split Train and Test
*Basic cutoff = 3
*Using stratified method
14. Text Mining Term Project 14
Modeling – Word2vec
To train word2vec model, we put the answers (label) between the tokenized words in
the question. Using this corpus, we trained word2vec model.
*put labels in every 5 words
Q: question, A: answer(label), W: word
The number of labels in train, test data : 2021
Train data set : 22620 ,Test data set: 5655
skip-gram is employed
Dimensionality of the feature vectors - 300
Window size - 10
Hierarchical softmax used
16. Text Mining Term Project 16
Modeling – Word2vec
Each word in the test set is embedded into the model to obtain a word vector.
Combine all the vectors into one vector on a question-by-question basis (Document vector)
17. Text Mining Term Project 17
Modeling – Word2vec
Also embedding the unique answers (label) into the model to obtain label vector. After
that, Calculate pairwise cosine similarity between the label vectors (𝑽 𝑨𝒏 ) and the
document vectors (𝐕′ 𝒌)
K: the number test set data
n: the number of unique labels
18. Text Mining Term Project 18
Modeling – Word2vec
Finally, Normalize the cosine similarity result for each document vectors, binarize the
answers(label), and evaluate the performance of the model
Test set
Ex)
𝑉′ 𝑘 = 0.05,0.001, 0.003 … . 0.002
𝐴 𝑘 = [1,0,0 …. 0]
19. Text Mining Term Project 19
Modeling – Doc2vec
In the Doc2vec model, we do not need to put the correct answer like in the word2vec
model, because the answer (label) is also learned as a tag.
PV distributed memory is employed.
Dimensionality of the feature vectors. - 300
Window size - 3
hierarchical softmax used
use the sum of the context word vectors.
The paragraph vectors(label vectors) are asked to a prediction task about
the next word in the sentence. Every paragraph is mapped to a unique
vector. The paragraph vector and word vectors are averaged to predict
the next word in a context.
20. Text Mining Term Project 20
Modeling – Doc2vec
*Computes cosine similarity between a simple mean of the projection weight vectors of the given docs.
22. Text Mining Term Project 22
Model evaluation – ROC curve
The ROC curve results for each labels are evaluated by two methods: micro-averaging
and macro-averaging.
micro-averaging - considering each element of the label indicator matrix as a binary prediction
macro-averaging - gives equal weight to the classification of each label.
word2vec doc2vec
AUC 0.78, 0.82 AUC 0.97, 0.97
(Micro, macro)(Micro, macro)
23. Text Mining Term Project 23
Model evaluation – Top-n Accuracy approach
Top-n accuracy approach results for each labels are evaluated. For example, Top-5
accuracy means that any of our model 5 highest probability answers must match the
expected answer. (n: 1~10)
word2vec doc2vec
Accuracy* : top 1 and top 10 accuracy
Accuracy*
0.08 to 0.20
Accuracy*
0.49 to 0.73
24. Text Mining Term Project 24
Discussion
• Conclusion
✓ Overall, doc2vec shows better performance than word2vec model
✓ Building a service by presenting n (at least 5) correct answer lists for new questions
✓ Application to speech recognition based movie recommendation service
• Further study
✓ Problems that questions about untrained movies
- complementing through learning synopsis of the movies
✓ A method for dealing with imbalanced movie data is needed
26. Text Mining Term Project 26
APPENDIX
We drew graphs to find which movies and which genres are highly asked. We could find
that people wanted to find mysterious and thrilling movies
259
222
179
178
177
166
151
147
143
131
스파이더위크가의 비밀
캐빈 인 더 우즈
비밀의 숲 테라비시아
레모니 스니켓의 위험한…
플립
트루먼 쇼
다이버전트
스플라이스
아바타
업사이드 다운
7447
5001
4585
2560
784
706
501
공포, 스릴러
SF, 판타지
로맨스, 멜로
액션, 무협
코미디
드라마
애니메이션
Asked Movie Ranking Asked Movie Genre
27. Text Mining Term Project 27
APPENDIX
Delete unnecessory phrase in question
At the beginning of the question and at the end, remove all phrases before and after the word.
If the word in the check words list (within 20% of the length of the question)