- LDAvis is an interactive visualization tool built using R and D3 to help users interpret topics estimated using Latent Dirichlet Allocation (LDA).
- It aims to answer questions about the meaning of each topic, the prevalence of each topic, and how topics relate to each other.
- The tool visualizes term relevance, topic prevalence, and inter-topic distances to help users understand the topics in a corpus.
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...Lukas Galke
Abstract: We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the
full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.
Meta-learning, or learning how to learn, is our innate ability to learn new, ever more complex tasks very efficiently by building on prior experience. It is a very exciting direction for machine learning (and AI in general). In this tutorial, I introduce the main concepts and state of the art.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Bytewise Approximate Match: Theory, Algorithms and ApplicationsLiwei Ren任力偉
Byte-wise approximate matching has become an important field in computer science that includes not only practical value but also theoretical significance. This talk will use six cases to define and describe the concept of approximate matching rigorously. They are identicalness, containment, cross-sharing, similarity, approximate containment and approximate cross-sharing. Based on the concept of approximate matching, one can propose a theoretic framework that consists of many problems of approximate matching, searching & clustering. Algorithmic solutions and challenges of the matching problems will be briefed as well as theoretic analysis. This framework also includes some elements of our previous works in both document fingerprinting problem and mathematical evaluation of similarity digest schemes { TLSH, ssdeep, sdhash }. In the end, we will discuss applications in various security disciplines.
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
Experimental work done regarding the use of Topic Modeling for the implementation and the improvement of some common tasks of Information Retrieval and Word Sense Disambiguation.
First of all it describes the scenario, the pre-processing pipeline realized and the framework used. After we we face a discussion related to the investigation of some different hyperparameters configurations for the LDA algorithm.
This work continues dealing with the retrieval of relevant documents mainly through two different approaches: inferring the topics distribution of the held out document (or query) and comparing it to retrieve similar collection’s documents or through an approach driven by probabilistic querying. The last part of this work is devoted to the investigation of the word sense disambiguation task.
This is an introduction of Topic Modeling, including tf-idf, LSA, pLSA, LDA, EM, and some other related materials. I know there are definitely some mistakes, and you can correct them with your wisdom. Thank you~
Evaluating the Impact of Word Embeddings on Similarity Scoring in Practical I...Lukas Galke
Abstract: We assess the suitability of word embeddings for practical information retrieval scenarios. Thus, we assume that users issue ad-hoc short queries where we return the first twenty retrieved documents after applying a boolean matching operation between the query and the documents. We compare the performance of several techniques that leverage word embeddings in the retrieval models to compute the similarity between the query and the documents, namely word centroid similarity, paragraph vectors, Word Mover’s distance, as well as our novel inverse document frequency (IDF) re-weighted word centroid similarity. We evaluate the performance using the ranking metrics mean average precision, mean reciprocal rank, and normalized discounted cumulative gain. Additionally, we inspect the retrieval models’ sensitivity to document length by using either only the title or the
full-text of the documents for the retrieval task. We conclude that word centroid similarity is the best competitor to state-of-the-art retrieval models. It can be further improved by re-weighting the word frequencies with IDF before aggregating the respective word vectors of the embedding. The proposed cosine similarity of IDF re-weighted word vectors is competitive to the TF-IDF baseline and even outperforms it in case of the news domain with a relative percentage of 15%.
Meta-learning, or learning how to learn, is our innate ability to learn new, ever more complex tasks very efficiently by building on prior experience. It is a very exciting direction for machine learning (and AI in general). In this tutorial, I introduce the main concepts and state of the art.
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
Slides for the talk about the paper:
Ziqi Zhang, Johann Petrak and Diana Maynard, 2018: Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. Semantics-2018, Vienna, Austria
Supporting Springer Nature Editors by means of Semantic TechnologiesFrancesco Osborne
The Open University and Springer Nature have been collaborating since 2015 in the development of an array of semantically-enhanced solutions supporting editors in i) classifying proceedings and other editorial products with respect to the relevant research areas and ii) taking informed decisions about their marketing strategy. These solutions include i) the Smart Topic API, which automatically maps keywords associated with published papers to semantically characterized topics, which are drawn from a very large and automatically-generated ontology of Computer Science topics; ii) the Smart Topic Miner, which helps editors to associate scholarly metadata to books; and iii) the Smart Book Recommender, which assists editors in deciding which editorial products should be marketed in a specific venue.
In this natural language understanding (NLU) project, we implemented and compared various approaches for predicting the topics of paragraph-length texts. This paper explains our methodology and results for the following approaches: Naive Bayes, One-vs-Rest Support Vector Machine (OvR SVM) with GloVe vectors, Latent Dirichlet Allocation (LDA) with OvR SVM, Convolutional Neural Networks (CNN), and Long Short Term Memory networks (LSTM).
Multimodal Searching and Semantic Spaces: ...or how to find images of Dalmati...Jonathon Hare
Tutorial at the "Reality of the Semantic Gap in Image Retrieval" tutorial at the first international conference on Semantics And digital Media Technology (SAMT 2006). 6th December 2006.
"Towards a Science of Reproducible Science?" DPRMA Workshop talk at JCDL 2013, Indianapolis, 25th July 2013. Workshop website is http://dprma.oerc.ox.ac.uk/
Paper is
David De Roure. 2013. Towards computational research objects. In Proceedings of the 1st International Workshop on Digital Preservation of Research Methods and Artefacts (DPRMA '13). ACM, New York, NY, USA, 16-19. DOI=10.1145/2499583.2499590 http://doi.acm.org/10.1145/2499583.2499590
Выступление Сергея Кольцова (НИУ ВШЭ) на International Conference on Big Data and its Applications (ICBDA).
ICBDA — конференция для предпринимателей и разработчиков о том, как эффективно решать бизнес-задачи с помощью анализа больших данных.
http://icbda2015.org/
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
A Text Mining Research Based on LDA Topic Modellingcsandit
A Large number of digital text information is gener
ated every day. Effectively searching,
managing and exploring the text data has become a m
ain task. In this paper, we first represent
an introduction to text mining and a probabilistic
topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles a
nd users’ tweets topic modelling. The
former one builds up a document topic model, aiming
to a topic perspective solution on
searching, exploring and recommending articles. The
latter one sets up a user topic model,
providing a full research and analysis over Twitter
users’ interest. The experiment process
including data collecting, data pre-processing and
model training is fully documented and
commented. Further more, the conclusion and applica
tion of this paper could be a useful
computation tool for social and business research.
A scalable gibbs sampler for probabilistic entity linkingSunny Kr
Entity linking involves labeling phrases in text with their referent entities, such as Wikipedia or Freebase entries. This task is challenging due to the large number of possible entities, in the millions, and heavy-tailed mention ambiguity. We formulate the problem in terms of probabilistic inference within a topic model, where each topic is associated with a Wikipedia article. To deal with the large number of topics we propose a novel efficient Gibbs sampling scheme which can also incorporate side information, such as the Wikipedia graph. This conceptually simple probabilistic approach achieves state-of-the-art performance in entity-linking on the Aida-CoNLL dataset.
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
Talk given at the 8th Forum for Information Retrieval Evaluation (FIRE, http://fire.irsi.res.in/fire/2016/), December 10, 2016, and at the Qatar Computing Research Institute (QCRI), December 15, 2016.
Semantic Web languages: Expressivity vs scalabilitynvitucci
- Semantic Web and huge data sources are becoming more and more popular
- Reasoning should scale well, but the whole point of DLs is to be expressive
- Different approaches to representation and to reasoning are needed
- Research is moving towards scalable reasoning for expressive logics
Le développement du Web et des réseaux sociaux ou les numérisations massives de documents contribuent à un renouvellement des Sciences Humaines et Sociales, des études des patrimoines littéraires ou culturels, ou encore de la façon dont est exploitée la littérature scientifique en général.
Les humanités numériques, qui croisent diverses disciplines avec l’informatique, posent comme centrales les questions du volume des données, de leur diversité, de leur origine, de leur véracité ou de leur représentativité. Les informations sont véhiculées au sein de « documents » textuels (livres, pages Web, tweets...), audio, vidéo ou multimédia. Ils peuvent comporter des illustrations ou des graphiques.
Appréhender de telles ressources nécessite le développement d'approches informatiques robustes, capables de passer à l’échelle et adaptées à la nature fondamentalement ambiguë et variée des informations manipulées (langage naturel ou images à interpréter, points de vue multiples…).
Si les approches d’apprentissage statistique sont monnaie courante pour des tâches de classification ou d’extraction d’information, elles doivent faire face à des espaces vectoriels creux et de dimension très élevées (plusieurs millions), être en mesure d’exploiter des ressources (par exemple des lexiques ou des thesaurus) et tenir compte ou produire des annotations sémantiques qui devront pouvoir être réutilisées.
Pour faire face à ces enjeux, des infrastructures ont été créées telle HumaNum à l’échelle nationale, DARIAH ou CLARIN à l’échelle européenne et des recommandations établies à l’échelle mondiale telle que la TEI (Text Encoding Initiative). Des plateformes au service de l’information scientifique comme l’équipement d’excellence OpenEdition.org sont une autre brique essentielle pour la préservation et l’accès aux « Big Digital Humanities » mais aussi pour favoriser la reproductibilité et la compréhension des expérimentations et des résultats obtenus.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
This presentation, created by Syed Faiz ul Hassan, explores the profound influence of media on public perception and behavior. It delves into the evolution of media from oral traditions to modern digital and social media platforms. Key topics include the role of media in information propagation, socialization, crisis awareness, globalization, and education. The presentation also examines media influence through agenda setting, propaganda, and manipulative techniques used by advertisers and marketers. Furthermore, it highlights the impact of surveillance enabled by media technologies on personal behavior and preferences. Through this comprehensive overview, the presentation aims to shed light on how media shapes collective consciousness and public opinion.
Announcement of 18th IEEE International Conference on Software Testing, Verif...
LDAvis
1. LDAvis:
A method for visualizing and interpreting topics
Carson Sievert Iowa State University
Kenneth E. Shirley AT&T Labs Research
Illvi2014
2. • LDAvis, a web-based interactive visualization of topics
estimated using Latent Dirichlet Allocation that is built using
a combination of R and D3.
• We introduce LDAvis that attempts to answer a few basic
questions about a fitted topic model:
I . I N T R O D U C T I O N
3. A. What is the meaning of each topic?,
B. How prevalent is each topic?,
C. How do the topics relate to each other?
I . I N T R O D U C T I O N
4. I I . R E L A T E D W O R K
LDA Turbo-
Topics
liftPMI
5. I I . R E L A T E D W O R K
LDA
Latent Dirichlet allocation (LDA)
is an example of a topic model and
was first presented as a graphical
model for topic discovery by
David Blei, Andrew Ng, and
Michael I. Jordan in 2003.
6. I I . R E L A T E D W O R K
Turbo-
Topics
Blei and Lafferty (2009) developed
“Turbo Topics”, a method of
identifying n-grams within LDA
inferred topics。
the resulting output is still simply a
ranked list containing a mixture of
terms and n-grams。
8. I I . R E L A T E D W O R K
Pointwise Mutual Information(PMI) :
PMI
Independence : P(x , y) = P(x) × P(y)
PMI = log1 = 0
9. I I . R E L A T E D W O R K
lift
Lift:
Φ kw denote the probability of term
w ∈ {1, ..., V } for topic k ∈ {1, ..., K},
where V denotes the number of
terms in the vocabulary, and let pw
denote the marginal probability of
term w in the corpus.
10. I I . T O P I C M O D E L V I S U A L I Z A T I O N
• A number of visualization systems for topic models have been
developed in recent years.
• But, the visualization elements are limited to barcharts or
word clouds of term probabilities for each topic, pie charts of
topic probabilities for each document, and/or various
barcharts or scatterplots related to document metadata.
11. I I . T O P I C M O D E L V I S U A L I Z A T I O N
• Chuang et al. (2012b) develop such a tool, called “Termite”,
which visualizes the set of topic term distributions estimated
in LDA using a matrix layout.
• http://vis.stanford.edu/papers/termite
12. I I . T O P I C M O D E L V I S U A L I Z A T I O N
• 在代表主題內容的關鍵字選擇上,Termite定義了saliency 變數: saliency(w)
= P(w)× distinctiveness(w)。
• w 代表一個關鍵字; P(w)代表w的頻率;distinctiveness(w)代表w在主題間
的差異性,包含w的主題越多,distinctiveness(w)值越低。
• 在關鍵詞排列上,Termite 把經常連續出現的詞,如social和networks 排在
一起以便增強語義讓使用者較好理解。
13. I I I . R E L E V A N C E O F T E R M S T O T O P I C S
• Here we define relevance, our method for ranking terms
within topics, and we describe the results of a user study to
learn an optimal tuning parameter in the computation of
relevance.
• We define the relevance of term w to topic k given a weight
parameter λ (where 0 ≦ λ ≦ 1).
14. I I I . R E L E V A N C E O F T E R M S T O T O P I C S
• “λ” determines the weight given to the probability of term w
under topic k relative to its lift.
• Setting “λ” = 1 results in the familiar ranking of terms in
decreasing order of their topic-specific probability, and
setting “λ”= 0 ranks terms solely by their lift.
15. I I I . R E L E V A N C E O F T E R M S T O T O P I C S
• We fit a 50-topic model to the 20 Newsgroups data and each
term in the vocabulary (which has size V = 22, 524) for a
given topic.
• Figure shows this plot for Topic 29, which documents posted
to the “Motorcycles” Newsgroup, but also from documents
posted to the “Automobiles” Newsgroup and the “Electronics”
Newsgroup.
16.
17. I I I . U S E R S T U D Y
13, 695 documents
20 Newsgroups
18. I I I . U S E R S T U D Y
• Some of the LDA-inferred topics occurred almost exclusively
(> 90% of occurrences) from a single Newsgroup, such as
Topic 38. which was came from the documents posted to the
“Medicine” (or “sci.med”) Newsgroup.
• Other topics occurred in a wide variety of Newsgroups. One
would expect these “spread-out” topics to be harder to
interpret than the “pure” topics like Topic 38.
19. I I I . U S E R S T U D Y
• we recruited 29 subjects among our colleagues (research
scientists at AT&T Labs with moderate familiarity with text
mining techniques and topic models).
• each subject completed an online experiment consisting of 50
tasks k (for k ∈ {1, ..., 50}).
• Task k was to read a list of five terms, ranked from 1-5 in
order of relevance to topic k, where “λ” ∈ (0, 1) was randomly
sampled to compute relevance.
20. I I I . U S E R S T U D Y
• we expected the proportion of correct responses to be roughly
1/3 no matter the value of λ used to compute relevance.
• In fact, seven of the topics were correctly identified by all 29
users, and one topic was incorrectly by all users.
• we estimated a topic specific intercept term to control the
difficulty of the topic (not just due to its tokens variety, but
also to account for the inherent familiarity of each topic to
our subject.)
21. I I I . U S E R S T U D Y
• The estimated effects of λ and λ² was statistically significant
(χ² p-value = 0.018).
• there was roughly a 67% baseline probability of correct
identification. As Figure 3 shows, for these topics, the
“optimal” value of λ was about 0.6.
• λ:0 ≒ 53 % and λ:1 ≒ 63 %
• We view this as evidence that where λ < 1 can really improve
topic interpretability.
23. I V . T H E L D A V I S S Y S T E M
Installation: pip install pyldavis
24. I V . T H E L D A V I S S Y S T E M
http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/
master/notebooks/pyLDAvis_overview.ipynb
25. I V . T H E L D A V I S S Y S T E M
http://nbviewer.jupyter.org/github/effytseng/dh_ldavis/blob/ma
ster/LDA-practice-1.html#topic=0&lambda=1&term=
http://nbviewer.jupyter.org/github/effytseng/dh_ldavis/blob/ma
ster/python-lldavis_test1-24-2.html
http://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/mast
er/notebooks/Gensim%20Newsgroup.ipynb
.用 gensim 處理文本:
.自由中國 ldavis 圖表:
.自由中國 l-ldavis 圖表:
26. I V . T H E L D A V I S S Y S T E M
• the areas of the circles are proportional to the relative
prevalences of the topics in the corpus
• The default for computing inter-topic distances is Jensen-
Shannon divergence.
• The default for scaling the set of inter-topic distances defaults
to Principal Components.
30. V . D I S C U S S I O N
A. What is the meaning of each topic?,
B. How prevalent is each topic?,
C. How do the topics relate to each other?
31. V . D I S C U S S I O N
• For future work, we anticipate performing a larger user study to
further understand how to facilitate topic interpretation in fitted
LDA models
• In addition to relevance. The need to visualize correlations
between topics can provide insight into what is happening on the
document level without actually displaying entire documents.
• Last, we seek a solution to the problem of visualizing a large
number of topics (say, from 100 - 500 topics) in a compact way.