Strategy 3 for topical phrase mining first performs phrase mining on a corpus to extract candidate phrases, then applies topic modeling with the phrases as constraints. This approach generates coherent topics where words within a phrase share a topic label. Strategy 3 outputs high-quality topics and phrases faster than Strategies 1 and 2, and the topics and phrases have better coherence, quality, and intrusion scores. However, Strategy 3's topic inference relies on the accuracy of the initial phrase mining.
ICMT 2016: Search-Based Model Transformations with MOMoTMartin Fleck
A tool demonstration of the MOMoT framework for the ICMT conference 2016 in Vienna. The demonstration shows the application of search-based model transformations on the Class Responsibility Assignment problem.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.
[PythonPH] Transforming the call center with Text mining and Deep learning (C...Paul Lo
Transforming the call center with Text mining and Deep learning:
1. Text ming tool to unlock user insights
2. Artificial Intelligence revolution in call centers: deep learning-based bot
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
Weather forecasting is most challenging problem around the world. There are various reason because of its experimented values in meteorology, but it is also a typical unbiased time series forecasting problem in scientific research. A lots of methods proposed by various scientists. The motive behind research is to predict more accurate. This paper contribute the same using artificial neural network (ANN) and simulated in MATLAB to predict two important weather parameters i.e. maximum and minimum temperature. The model has been trained using past 60 years of real data collected from(1901-1960) and tested over 40 years to forecast maximum and minimum temperature. The results based on mean square error function (MSE) confirm, this model which is based on multilayer perceptron has the potential to successful application to weather forecasting
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM’s efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
ICMT 2016: Search-Based Model Transformations with MOMoTMartin Fleck
A tool demonstration of the MOMoT framework for the ICMT conference 2016 in Vienna. The demonstration shows the application of search-based model transformations on the Class Responsibility Assignment problem.
Neural Models for Information RetrievalBhaskar Mitra
In the last few years, neural representation learning approaches have achieved very good performance on many natural language processing (NLP) tasks, such as language modelling and machine translation. This suggests that neural models will also yield significant performance improvements on information retrieval (IR) tasks, such as relevance ranking, addressing the query-document vocabulary mismatch problem by using semantic rather than lexical matching. IR tasks, however, are fundamentally different from NLP tasks leading to new challenges and opportunities for existing neural representation learning approaches for text.
We begin this talk with a discussion on text embedding spaces for modelling different types of relationships between items which makes them suitable for different IR tasks. Next, we present how topic-specific representations can be more effective than learning global embeddings. Finally, we conclude with an emphasis on dealing with rare terms and concepts for IR, and how embedding based approaches can be augmented with neural models for lexical matching for better retrieval performance. While our discussions are grounded in IR tasks, the findings and the insights covered during this talk should be generally applicable to other NLP and machine learning tasks.
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
Malek Chebil, Rim Jallouli, Mohamed Anis Bach Tobji and Chiheb Eddine Ben Ncir. Topic modeling of marketing scientific papers: An experimental survey. (ICDEc 2021)
Extraction of Data Using Comparable Entity Miningiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
Big Data Palooza Talk: Aspects of Semantic ProcessingNa'im Tyson
This inaugural, Meetup talk, sponsored by the Knowledgent Group, discussed aspects of semantic processing, and emphasized using python for lexical semantics. Slides cite example code snippets for computing the relationships between words using the Natural Language Toolkit (NLTK) in Python. There is also a small overview of the technologies underlying the Semantic Web and text mining.
[PythonPH] Transforming the call center with Text mining and Deep learning (C...Paul Lo
Transforming the call center with Text mining and Deep learning:
1. Text ming tool to unlock user insights
2. Artificial Intelligence revolution in call centers: deep learning-based bot
Question Classification using Semantic, Syntactic and Lexical featuresdannyijwest
Weather forecasting is most challenging problem around the world. There are various reason because of its experimented values in meteorology, but it is also a typical unbiased time series forecasting problem in scientific research. A lots of methods proposed by various scientists. The motive behind research is to predict more accurate. This paper contribute the same using artificial neural network (ANN) and simulated in MATLAB to predict two important weather parameters i.e. maximum and minimum temperature. The model has been trained using past 60 years of real data collected from(1901-1960) and tested over 40 years to forecast maximum and minimum temperature. The results based on mean square error function (MSE) confirm, this model which is based on multilayer perceptron has the potential to successful application to weather forecasting
Presentation of work that will be published at EMNLP 2016.
Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bošnjak, Sebastian Riedel. emoji2vec: Learning Emoji Representations from their Description. SocialNLP at EMNLP 2016. https://arxiv.org/abs/1609.08359
Georgios Spithourakis, Isabelle Augenstein, Sebastian Riedel. Numerically Grounded Language Models for Semantic Error Correction. EMNLP 2016. https://arxiv.org/abs/1608.04147
Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, Kalina Bontcheva. Stance Detection with Bidirectional Conditional Encoding. EMNLP 2016. https://arxiv.org/abs/1606.05464
Machine Status Prediction for Dynamic and Heterogenous Cloud Environmentjins0618
The widespread utilization of cloud computing services
has brought in the emergence of cloud service reliability
as an important issue for both cloud providers and users. To
enhance cloud service reliability and reduce the subsequent losses, the future status of virtual machines should be monitored in real time and predicted before they crash. However, most existing methods ignore the following two characteristics of actual cloud
environment, and will result in bad performance of status prediction:
1. cloud environment is dynamically changing; 2. cloud
environment consists of many heterogeneous physical and virtual
machines. In this paper, we investigate the predictive power of
collected data from cloud environment, and propose a simple yet
general machine learning model StaP to predict multiple machine
status. We introduce the motivation, the model development
and optimization of the proposed StaP. The experimental results
validated the effectiveness of the proposed StaP.
Latent Interest and Topic Mining on User-item Bipartite Networksjins0618
Latent Factor Model (LFM) is extensively used in
dealing with user-item bipartite networks in service recommendation systems. To alleviate the limitations of LFM, this papers presents a novel unsupervised learning model, Latent Interest and Topic Mining model (LITM), to automatically
mine the latent user interests and item topics from user-item
bipartite networks. In particular, we introduce the motivation
and objectives of this bipartite network based approach, and
detail the model development and optimization process of the
proposed LITM. This work not only provides an efficient method for latent user interest and item topic mining, but also highlights a new way to improve the accuracy of service recommendation. Experimental studies are performed and the results validate the LITM’s efficiency in model training, and its ability to provide better service recommendation performance based on user-item bipartite networks are demonstrated.
Web Service QoS Prediction Approach in Mobile Internet Environmentsjins0618
Existing many Web service QoS prediction
approaches are very accurate in Internet environments,
however they cannot provide accurate prediction values in
Mobile Internet environments since QoS values of Web
services have great volatility. In this paper, we propose an
accurate Web service QoS prediction approach by weakening
the volatility of QoS data from Web services in Mobile Internet
environments. This approach contains three process, i.e., QoS
preprocessing, user similarity computing, and QoS predicting.
We have implemented our proposed approach with experiment
based on real world and synthetic datasets. The results show
that our approach outperforms other approaches in Mobile
Internet environments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. Phrase Mining and Topic
Modeling from Large Corpora
JIAWEI HAN
COMPUTER SCIENCE
UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
JUNE 16, 2016
2
3. 3
Why Phrase Mining?
Unigrams vs. phrases
Unigrams (single words) are ambiguous
Example: “United”: United States? United Airline? United Parcel Service?
Phrase: A natural, meaningful, unambiguous semantic unit
Example: “United States” vs. “United Airline”
Mining semantically meaningful phrases
Transform text data from word granularity to phrase granularity
Enhance the power and efficiency at manipulating unstructured data using
database technology
4. 4
Phrase mining: Originated from the NLP community—“Chunking”
Model it as a sequence labeling problem (B-NP, I-NP, O, …)
Need annotation and training
Annotate hundreds of documents as training data
Train a supervised model based on part-of-speech features
Recent trend:
Use distributional features based on web n-grams (Bergsma et al., 2010)
State-of-the-art performance: ~95% accuracy, ~88% phrase-level F-score
Limitations
High annotation cost, not scalable to a new language, a new domain or genre
May not fit domain-specific, dynamic, emerging applications
Scientific domains, query logs, or social media, e.g., Yelp, Twitter
Mining Phrases: Why Not Use NLP Methods?
5. 5
Data Mining Approaches
General principle: Fully exploit information redundancy and data-
driven criteria to determine phrase boundaries and salience
Phrase Mining and Topic Modeling from Large Corpora
Strategy 1: Simultaneously Inferring Phrases and Topics
Strategy 2: Post Topic Modeling Phrase Construction
Strategy 3: First Phrase Mining then Topic Modeling (ToPMine)
Integration of Phrase Mining with Document Segmentation
7. 7
Frequent Pattern Mining for Text Data: Phrase
Mining and Topic Modeling
Motivation: Unigrams (single words) can be difficult to interpret
Ex.: The topic that represents the area of Machine Learning
learning
reinforcement
support
machine
vector
selection
feature
random
:
versus
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
8. 8
Various Strategies: Phrase-Based Topic Modeling
Strategy 1: Generate bag-of-words → generate sequence of tokens
Bigram topical model [Wallach’06], topical n-gram model [Wang, et al.’07],
phrase discovering topic model [Lindsey, et al.’12]
Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
Label topic [Mei et al.’07], TurboTopic [Blei & Lafferty’09], KERT [Danilevsky, et
al.’14]
Strategy 3: Prior bag-of-words model inference, mine phrases and impose on
the bag-of-words model
ToPMine [El-kishky, et al.’15]
10. 10
Strategy 1: Simultaneously Inferring Phrases and Topics
Bigram Topic Model [Wallach’06]
Probabilistic generative model that conditions on previous word and topic when
drawing next word
Topical N-Grams (TNG) [Wang, et al.’07]
Probabilistic model that generates words in textual order
Create n-grams by concatenating successive bigrams (a generalization of Bigram
Topic Model)
Phrase-Discovering LDA (PDLDA) [Lindsey, et al.’12]
Viewing each sentence as a time-series of words, PDLDA posits that the
generative parameter (topic) changes periodically
Each word is drawn based on previous m words (context) and current phrase topic
High model complexity: Tends to overfitting; High inference cost: Slow
14. 14
Strategy 2: Post Topic Modeling Phrase Construction
TurboTopics [Blei & Lafferty’09] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
Perform Latent Dirichlet Allocation on corpus to assign each token a topic label
Merge adjacent unigrams with the same topic label by a distribution-free
permutation test on arbitrary-length back-off model
End recursive merging when all significant adjacent unigrams have been
merged
KERT [Danilevsky et al.’14] – Phrase construction as a post-processing step
to Latent Dirichlet Allocation
Perform frequent pattern mining on each topic
Perform phrase ranking based on four different criteria
15. 15
Example of TurboTopics
Perform LDA on corpus to assign each token a topic label
E.g., … phase11 transition11 …. game153 theory127 …
Then merge adjacent unigrams with same topic label
16. 16
KERT: Topical Keyphrase Extraction & Ranking
[Danilevsky, et al. 2014]
learning
support vector machines
reinforcement learning
feature selection
conditional random fields
classification
decision trees
:
Topical keyphrase
extraction & ranking
knowledge discovery using least squares support vector machine classifiers
support vectors for reinforcement learning
a hybrid approach to feature selection
pseudo conditional random fields
automatic web page classification in a dynamic and hierarchical way
inverse time dependency in convex regularized learning
postprocessing decision trees to extract actionable knowledge
variance minimization least squares support vector machines
…
Unigram topic assignment: Topic 1 & Topic 2
17. 17
Framework of KERT
1. Run bag-of-words model inference and assign topic label to each token
2. Extract candidate keyphrases within each topic
3. Rank the keyphrases in each topic
Popularity: ‘information retrieval’ vs. ‘cross-language information retrieval’
Discriminativeness: only frequent in documents about topic t
Concordance: ‘active learning’ vs.‘learning classification’
Completeness: ‘vector machine’ vs. ‘support vector machine’
Frequent pattern mining
Comparability property: directly compare phrases of mixed lengths
18. 18
Comparison of phrase ranking methods
The topic that represents the area of Machine Learning
kpRel
[Zhao et al. 11]
KERT
(-popularity)
KERT
(-discriminativeness)
KERT
(-concordance)
KERT
[Danilevsky et al. 14]
learning effective support vector machines learning learning
classification text feature selection classification support vector machines
selection probabilistic reinforcement learning selection reinforcement learning
models identification conditional random fields feature feature selection
algorithm mapping constraint satisfaction decision conditional random fields
features task decision trees bayesian classification
decision planning dimensionality reduction trees decision trees
: : : : :
KERT: Topical Phrases on Machine Learning
Top-Ranked Phrases by Mining Paper Titles in DBLP
20. 20
Strategy 3: First Phrase Mining then Topic Modeling
ToPMine [El-Kishky et al. VLDB’15]
First phrase construction, then topic mining
Contrast with KERT: topic modeling, then phrase mining
The ToPMine Framework:
Perform frequent contiguous pattern mining to extract candidate phrases and
their counts
Perform agglomerative merging of adjacent unigrams as guided by a significance
score—This segments each document into a “bag-of-phrases”
The newly formed bag-of-phrases are passed as input to PhraseLDA, an extension
of LDA, that constrains all words in a phrase to each sharing the same latent topic
21. 21
Why First Phrase Mining then Topic Modeling ?
With Strategy 2, tokens in the same phrase may be assigned to different topics
Ex. knowledge discovery using least squares support vector machine classifiers…
Knowledge discovery and support vector machine should have coherent topic labels
Solution: switch the order of phrase mining and topic model inference
Techniques
Phrase mining and document segmentation
Topic model inference with phrase constraint
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
[knowledge discovery] using [least
squares] [support vector machine]
[classifiers] …
22. 22
Phrase Mining: Frequent Pattern Mining + Statistical
Analysis
[Markov blanket] [feature selection] for [support vector
machines]
[knowledge discovery] using [least squares] [support
vector machine] [classifiers]
…[support vector] for [machine learning]…
Phrase Raw
freq.
True
freq.
[support vector machine] 90 80
[vector machine] 95 0
[support vector] 100 20
Quality
phrases
Based on significance score [Church et al.’91]:
α(P1, P2) ≈ (f(P1●P2) ̶ µ0(P1,P2))/√ f(P1●P2)
23. 23
Collocation Mining
Collocation: A sequence of words that occur more frequently than expected
Often “interesting” and due to their non-compositionality, often relay information
not portrayed by their constituent terms (e.g., “made an exception”, “strong tea”)
Many different measures used to extract collocations from a corpus [Dunning 93,
Pederson 96]
E.g., mutual information, t-test, z-test, chi-squared test, likelihood ratio
Many of these measures can be used to guide the agglomerative phrase-
segmentation algorithm
24. 24
ToPMine: Phrase LDA (Constrained Topic Modeling)
The generative model for PhraseLDA is the same as
LDA
Difference: the model incorporates constraints
obtained from the “bag-of-phrases” input
Chain-graph shows that all words in a phrase are
constrained to take on the same topic values
[knowledge discovery] using [least squares]
[support vector machine] [classifiers] …
Topic model inference with phrase constraints
25. 25
Example Topical Phrases: A Comparison
ToPMine [El-kishky et al. 14]
Strategy 3 (67 seconds)
information retrieval feature selection
social networks machine learning
web search semi supervised
search engine large scale
information
extraction
support vector
machines
question answering active learning
web pages face recognition
: :
Topic 1 Topic 2
social networks information retrieval
web search text classification
time series machine learning
search engine support vector machines
management system information extraction
real time neural networks
decision trees text categorization
: :
Topic 1 Topic 2
PDLDA [Lindsey et al. 12] Strategy 1
(3.72 hours)
29. Session 5. A Comparative
Study of Three Strategies
30. 30
Efficiency: Running Time of Different Strategies
Strategy 1: Generate bag-of-words → generate sequence of tokens
Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Running time: strategy 3 > strategy 2 > strategy 1 (“>” means outperforms)
31. 31
Coherence of Topics: Comparison of Strategies
Strategy 1: Generate bag-of-words → generate sequence of tokens
Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the bag-of-words model
Coherence measured by z-score: strategy 3 > strategy 2 > strategy 1
32. 32
Phrase Intrusion: Comparison of Strategies
Phrase intrusion measured by average number of correct answers:
strategy 3 > strategy 2 > strategy 1
33. 33
Phrase Quality: Comparison of Strategies
Phrase quality measured by z-score:
strategy 3 > strategy 2 > strategy 1
34. 34
Summary: Strategies on Topical Phrase Mining
Strategy 1: Generate bag-of-words → generate sequence of tokens
Integrated complex model; phrase quality and topic inference rely on each
other
Slow and overfitting
Strategy 2: Post bag-of-words model inference, visualize topics with n-grams
Phrase quality relies on topic labels for unigrams
Can be fast; generally high-quality topics and phrases
Strategy 3: Prior bag-of-words model inference, mine phrases and impose to the
bag-of-words model
Topic inference relies on correct segmentation of documents, but not sensitive
Can be fast; generally high-quality topics and phrases
36. Session 6. Mining Quality Phrases
from Massive Text Corpora: A
SegPhrase Approach
37. 37
Mining Phrases: Why Not Use Raw Frequency Based
Methods?
Traditional data-driven approaches
Frequent pattern mining
If AB is frequent, likely AB could be a phrase
Raw frequency could NOT reflect the quality of phrases
E.g., freq(vector machine) ≥ freq(support vector machine)
Need to rectify the frequency based on segmentation results
Phrasal segmentation will tell
Some words should be treated as a whole phrase whereas others are still
unigrams
38. 38
SegPhrase: From Raw Corpus
to Quality Phrases and Segmented Corpus
Document 1
Citation recommendation is an interesting but
challenging research problem in data mining area.
Document 2
In this study, we investigate the problem in the
context of heterogeneous information networks
using data mining technique.
Phrase Mining
Document 3
Principal Component Analysis is a linear
dimensionality reduction technique commonly used
in machine learning applications.
Quality Phrases
Phrasal Segmentation
Raw Corpus Segmented Corpus
Input Raw Corpus Quality Phrases Segmented Corpus
39. 39
SegPhrase: The Overall Framework
ClassPhrase: Frequent pattern mining, feature extraction, classification
SegPhrase: Phrasal segmentation and phrase quality estimation
SegPhrase+: One more round to enhance mined phrase quality
ClassPhrase SegPhrase(+)
40. 40
What Kind of Phrases Are of “High Quality”?
Judging the quality of phrases
Popularity
“information retrieval” vs. “cross-language information retrieval”
Concordance
“powerful tea” vs. “strong tea”
“active learning” vs. “learning classification”
Informativeness
“this paper” (frequent but not discriminative, not informative)
Completeness
“vector machine” vs. “support vector machine”
41. 41
ClassPhrase I: Pattern Mining for Candidate Set
Build a candidate phrases set by frequent pattern mining
Mining frequent k-grams
k is typically small, e.g. 6 in our experiments
Popularity measured by raw frequent words and phrases mined from the corpus
42. 42
ClassPhrase II: Feature Extraction: Concordance
Partition a phrase into two parts to check whether the co-occurrence is
significantly higher than pure random
support vector machine this paper demonstrates
Pointwise mutual information:
Pointwise KL divergence:
The additional p(v) is multiplied with pointwise mutual information,
leading to less bias towards rare-occurred phrases
𝑢𝑙 𝑢𝑙𝑢 𝑟
𝑢 𝑟
43. 43
ClassPhrase II: Feature Extraction: Informativeness
Deriving Informativeness
Quality phrases typically start and end with a non-stopword
“machine learning is” vs. “machine learning”
Use average IDF over words in the phrase to measure the semantics
Usually, the probabilities of a quality phrase in quotes, brackets, or connected
by dash should be higher (punctuations information)
“state-of-the-art”
We can also incorporate features using some NLP techniques, such as POS
tagging, chunking, and semantic parsing
44. 44
ClassPhrase III: Classifier
Limited Training
Labels: Whether a phrase is a quality one or not
“support vector machine”: 1
“the experiment shows”: 0
For ~1GB corpus, only 300 labels
Random Forest as our classifier
Predicted phrase quality scores lie in [0, 1]
Bootstrap many different datasets from limited labels
45. 45
SegPhrase: Why Do We Need Phrasal
Segmentation in Corpus?
Phrasal segmentation can tell which phrase is more appropriate
Ex: A standard ⌈feature vector⌋ ⌈machine learning⌋ setup is used to describe...
Rectified phrase frequency (expected influence)
Example:
Not counted towards the rectified frequency
46. 46
SegPhrase: Segmentation of Phrases
Partition a sequence of word by maximizing the likelihood
Considering
Phrase quality score
ClassPhrase assigns a quality score for each phrase
Probability in corpus
Length penalty
length penalty 𝛼: when 𝛼 > 1, it favors shorter phrases
Filter out phrases with low rectified frequency
Bad phrases are expected to rarely occur in the segmentation results
47. 47
SegPhrase+: Enhancing Phrasal Segmentation
SegPhrase+: One more round for enhanced phrasal segmentation
Feedback
Using rectified frequency, re-compute those features previously computing
based on raw frequency
Process
Classification Phrasal segmentation // SegPhrase
Classification Phrasal segmentation // SegPhrase+
Effects on computing quality scores
np hard in the strong sense
np hard in the strong
data base management system
48. 48
Performance Study: Methods to Be Compared
Other phase mining methods: Methods to be compared
NLP chunking based methods
Chunks as candidates
Sorted by TF-IDF and C-value (K. Frantzi et al., 2000)
Unsupervised raw frequency based methods
ConExtr (A. Parameswaran et al., VLDB 2010)
ToPMine (A. El-Kishky et al., VLDB 2015)
Supervised method
KEA, designed for single document keyphrases (O. Medelyan & I. H. Witten,
2006)
49. 49
Performance Study: Experimental Setting
Datasets
Popular Wiki Phrases
Based on internal links
~7K high quality phrases
Pooling
Sampled 500 * 7 Wiki-uncovered phrases
Evaluated by 3 reviewers independently
Dataset #docs #words #labels
DBLP 2.77M 91.6M 300
Yelp 4.75M 145.1M 300
50. 50
Performance: Precision Recall Curves on DBLP
Compare
with other
baselines
TF-IDF
C-Value
ConExtr
KEA
ToPMine
SegPhrase+
Compare with
our 3 variations
TF-IDF
ClassPhrase
SegPhrase
SegPhrase+
50
52. 52
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGMOD)
Query SIGMOD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data base data base
2 database system database system
3 relational database query processing
4 query optimization query optimization
5 query processing relational database
… … …
51 sql server database technology
52 relational data database server
53 data structure large volume
54 join query performance study
55 web service web service
… … …
201 high dimensional data efficient implementation
202 location based service sensor network
203 xml schema large collection
204 two phase locking important issue
205 deep web frequent itemset
… … …
Only in SegPhrase+ Only in Chunking
53. 53
Experimental Results: Interesting Phrases
Generated (From the Titles and Abstracts of SIGKDD)
Query SIGKDD
Method SegPhrase+ Chunking (TF-IDF & C-Value)
1 data mining data mining
2 data set association rule
3 association rule knowledge discovery
4 knowledge discovery frequent itemset
5 time series decision tree
… … …
51 association rule mining search space
52 rule set domain knowledge
53 concept drift importnant problem
54 knowledge acquisition concurrency control
55 gene expression data conceptual graph
… … …
201 web content optimal solution
202 frequent subgraph semantic relationship
203 intrusion detection effective way
204 categorical attribute space complexity
205 user preference small set
… … …
Only in SegPhrase+ Only in Chunking
54. 54
Experimental Results: Similarity Search
Find high-quality similar phrases based on user’s phrase query
In response to a user’s phrase query, SegPhrase+ generates high quality,
semantically similar phrases
In DBLP, query on “data mining” and “OLAP”
In Yelp, query on “blu-ray”, “noodle”, and “valet parking”
56. 56
Recent Progress
Distant Training: No need of human labeling
Training using general knowledge bases
E.g., Freebase, Wikipedia
Quality Estimation for Unigrams
Integration of phrases and unigrams in one uniform framework
Demo based on DBLP abstract
Multi-languages: Beyond English corpus
Extensible to mining quality phrases in multiple languages
Recent progress: SegPhrase+ works on Chinese and Arabic
57. 57
Mining Quality Phrases in Multiple Languages
Rank Phrase In English
… … …
62 首席_执行官 CEO
63 中间_偏右 Middle-right
… … …
84 百度_百科 Baidu Pedia
85 热带_气旋 Tropical cyclone
86 中国科学院_院士 Fellow of Chinese
Academy of Sciences
… … …
1001 十大_中文_金曲 Top-10 Chinese Songs
1002 全球_资讯网 Global News Website
1003 天一阁_藏_明代_科举_录_选刊 A Chinese book name
… … …
9934 国家_戏剧_院 National Theater
9935 谢谢_你 Thank you
… … …
Both ToPMine and SegPhrase+ are
extensible to mining quality
phrases in multiple languages
SegPhrase+ on Chinese (From
Chinese Wikipedia)
ToPMine on Arabic (From Quran
(Fus7a Arabic)(no preprocessing)
Experimental results of Arabic
phrases:
كفروا Those who disbelieve
بسمهللاالرحمنالرحيم In the name
of God the Gracious and Merciful
58. 58
Summary on SegPrhase+ and Future Work
SegPhrase+: A new phrase mining framework
Integrating phrase mining with phrasal segmentation
Requires only limited training or distant training
Generates high-quality phrases, close to human judgement
Linearly scalable on time and space
Looking forward: High-quality, scalable phrase mining
Facilitate entity recognition and typing in large corpora
(See the next part of this tutorial)
Transform massive unstructured data into semi-structured knowledge
networks
60. 60
References
D. M. Blei and J. D. Lafferty. Visualizing Topics with Multi-Word Expressions, arXiv:0907.1013, 2009
K. Church, W. Gale, P. Hanks, D. Hindle. Using Statistics in Lexical Analysis. In U. Zernik (ed.), Lexical Acquisition:
Exploiting On-Line Resources to Build a Lexicon. Lawrence Erlbaum, 1991
M. Danilevsky, C. Wang, N. Desai, X. Ren, J. Guo, J. Han. Automatic Construction and Ranking of Topical
Keyphrases on Collections of Short Documents“, SDM’14
A. El-Kishky, Y. Song, C. Wang, C. R. Voss, and J. Han. Scalable Topical Phrase Mining from Text Corpora. VLDB’15
K. Frantzi, S. Ananiadou, and H. Mima, Automatic Recognition of Multi-Word Terms: the c-value/nc-value
Method. Int. Journal on Digital Libraries, 3(2), 2000
R. V. Lindsey, W. P. Headden, III, M. J. Stipicevic. A phrase-discovering topic model using hierarchical pitman-yor
processes, EMNLP-CoNLL’12.
J. Liu, J. Shang, C. Wang, X. Ren, J. Han, Mining Quality Phrases from Massive Text Corpora. SIGMOD’15
O. Medelyan and I. H. Witten, Thesaurus Based Automatic Keyphrase Indexing. IJCDL’06
Q. Mei, X. Shen, C. Zhai. Automatic Labeling of Multinomial Topic Models, KDD’07
A. Parameswaran, H. Garcia-Molina, and A. Rajaraman. Towards the Web of Concepts: Extracting Concepts from
Large Datasets. VLDB’10
X. Wang, A. McCallum, X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information
retrieval, ICDM’07
Editor's Notes
An interesting comparison
A state of the art phrase-discovering topic model
An interesting comparison
A state of the art phrase-discovering topic model
An interesting comparison
A state of the art phrase-discovering topic model
An interesting comparison
A state of the art phrase-discovering topic model
An interesting comparison
A state of the art phrase-discovering topic model
An interesting comparison
A state of the art phrase-discovering topic model
Then the picky audience might ask how to project phrase mining methods back to the corpus so we can get mention-level chunking results? Maybe during presentation we can emphasize that's not the goal, we think it's more interesting to discover important phrases from a large corpus without any manual annotations.