The document describes a method for mining top-k interesting phrases from ad-hoc document collections using sequence pattern indexing. It discusses existing approaches, presents the problem definition, and proposes a new approach that indexes prefix-maximal phrases ordered by position. This indexing structure allows efficient computation of top-k phrases through a merge join process combined with growing phrase patterns, enabled by optimizations like early termination and search space pruning. An evaluation compares the new approach to baseline methods.
A Unifying Four-State Labelling Semantics for Bridging Abstract Argumentation...Carlo Taticchi
In many formalisms extending Dung’s Abstract Argumentation Frameworks (AFs), arguments are not always “present”. In timed AFs, for instance, arguments are only available in precise intervals of time, as they can appear and disappear in an intermittent manner; in incomplete AFs, both attacks and arguments can be absent; in constellation probabilistic AFs (attacks and) arguments have a probability to be present or not, and possible worlds are generated for the computation of the semantics. We review current approaches and propose a four-state labelling semantics to take in account such absent/unknown state of an argument. The four labels we use can be traced to the states a belief can assume, allowing us to also define operations related to belief manipulation, like expansion contraction and revision. We also discuss how labels/states of arguments in an AFs can be modified by using belief revision operations.
A Concurrent Language for Argumentation: Preliminary NotesCarlo Taticchi
While agent-based modelling languages naturally implement concurrency, the currently available languages for argumentation do not allow to explicitly model this type of interaction. In this paper we introduce a concurrent language for handling process arguing and communicating using a shared argumentation framework (reminding shared constraint store as in concurrent constraint). We introduce also basic expansions, contraction and revision procedures as main bricks for enforcement, debate, negotiation and persuasion.
How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
Extending Labelling Semantics to Weighted Argumentation FrameworksCarlo Taticchi
Argumentation Theory provides tools for both modelling and reasoning with controversial information and is a methodology that is often used as a way to give explanations to results provided using machine learning techniques. In this con- text, labelling-based semantics for Abstract Argumentation Frameworks (AFs) allow for establishing the acceptability of sets of arguments, dividing them into three partitions: accept- able, rejected and undecidable (instead of classical Dung two sets IN and OUT partitions). This kind of semantics have been studied only for classical AFs, whilst the more powerful weighted and preference-based framework has been not studied yet. In this paper, we define a novel labelling semantics for Weighted Argumentation Frameworks, extending and generalising the crisp one.
A Unifying Four-State Labelling Semantics for Bridging Abstract Argumentation...Carlo Taticchi
In many formalisms extending Dung’s Abstract Argumentation Frameworks (AFs), arguments are not always “present”. In timed AFs, for instance, arguments are only available in precise intervals of time, as they can appear and disappear in an intermittent manner; in incomplete AFs, both attacks and arguments can be absent; in constellation probabilistic AFs (attacks and) arguments have a probability to be present or not, and possible worlds are generated for the computation of the semantics. We review current approaches and propose a four-state labelling semantics to take in account such absent/unknown state of an argument. The four labels we use can be traced to the states a belief can assume, allowing us to also define operations related to belief manipulation, like expansion contraction and revision. We also discuss how labels/states of arguments in an AFs can be modified by using belief revision operations.
A Concurrent Language for Argumentation: Preliminary NotesCarlo Taticchi
While agent-based modelling languages naturally implement concurrency, the currently available languages for argumentation do not allow to explicitly model this type of interaction. In this paper we introduce a concurrent language for handling process arguing and communicating using a shared argumentation framework (reminding shared constraint store as in concurrent constraint). We introduce also basic expansions, contraction and revision procedures as main bricks for enforcement, debate, negotiation and persuasion.
How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations
Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents.
This presentation does not assume any background in text mining or natural language processing. Examples will use Python.
Extending Labelling Semantics to Weighted Argumentation FrameworksCarlo Taticchi
Argumentation Theory provides tools for both modelling and reasoning with controversial information and is a methodology that is often used as a way to give explanations to results provided using machine learning techniques. In this con- text, labelling-based semantics for Abstract Argumentation Frameworks (AFs) allow for establishing the acceptability of sets of arguments, dividing them into three partitions: accept- able, rejected and undecidable (instead of classical Dung two sets IN and OUT partitions). This kind of semantics have been studied only for classical AFs, whilst the more powerful weighted and preference-based framework has been not studied yet. In this paper, we define a novel labelling semantics for Weighted Argumentation Frameworks, extending and generalising the crisp one.
Introducing a Tool for Concurrent ArgumentationCarlo Taticchi
Agent-based modelling languages naturally implement concurrency for handling complex interactions between communicating agents. On the other hand, the field of Argumentation Theory lacks of instruments to explicitly model concurrent behaviours. In this paper we intro- duce a tool for dealing with concurrent argumentation processes and that can be used, for instance, to model agents debating, negotiating and persuading. The tool implements operations as expansion, contraction and revision. We also provide a web interface exposing the functionalities of the tool and allowing for a more careful study of concurrent processes.
Looking for Invariant Operators in ArgumentationCarlo Taticchi
We study invariant local expansion operators for conflict-free and admissible sets in Abstract Argumentation Frameworks (AFs). Such operators are directly applied on AFs, and are invariant with respect to a chosen “semantics” (that is w.r.t. each of the conflict free/admissible set of arguments). Accordingly, we derive a definition of robustness for AFs in terms of the number of times such operators can be applied without producing any change in the chosen semantics.
A Matrix Based Approach for Weighted Argumentation FrameworksCarlo Taticchi
The assignment of weights to attacks in a classical Argumentation Framework allows to compute semantics by taking into account the different importance of each argument. We represent a Weighted Argumentation Framework by a non-binary matrix, and we characterise the basic extensions (such as w-admissible, w-stable, w-complete) by analysing sub-blocks of this matrix. Also, we show how to reduce the matrix into another one of smaller size, that is equivalent to the original one for the determination of extensions. Furthermore, we provide two algorithms that allow to build incrementally w-grounded and w-preferred extensions starting from a w-admissible extension.
Looking for Invariant Operators in ArgumentationCarlo Taticchi
We study invariant local expansion operators for conflict-free and admissible sets in Abstract Argumentation Frameworks (AFs). Such operators are directly applied on AFs, and are invariant with respect to a chosen “semantics” (that is w.r.t. each of the conflict free/admissible set of arguments). Accordingly, we derive a definition of robustness for AFs in terms of the number of times such operators can be applied without producing any change in the chosen semantics.
While agent-based modelling languages naturally implement concurrency, the currently available languages for argumentation do not allow to explicitly model this type of interaction. In this paper we introduce a concurrent language for handling process arguing and communicating using a shared argumentation framework (reminding shared constraint store as in concurrent constraint). We introduce also basic expansions, contraction and revision procedures as main bricks for en- forcement, debate, negotiation and persuasion.
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
Data produced by the Ozone PEATE from the Ozone Mapping and Profiler Suite (OMPS) instruments are to be stored in HDF5, not HDF-EOS, but will still need some features similar to those in HDF-EOS. In particular, a mechanism for handling dimension names will be needed. This poster proposes a method to handle dimension names for arrays in HDF5 in a manner commensurate with HDF-EOS5.
FScaFi: A Core Calculus for Collective Adaptive Systems ProgrammingRoberto Casadei
A recently proposed approach to the rigorous engineering of collective adaptive systems is the aggregate computing paradigm, which operationalises the idea of expressing collective adaptive behaviour by a global perspective as a functional composition of dynamic computational fields (i.e., structures mapping a collection of individual devices of a collective to computational values over time). In this paper, we present FScaFi, a core language that captures the essence of exploiting field computations in mainstream functional languages, and which is based on a semantic model for field computations leveraging the novel notion of “computation against a neighbour”. Such a construct models expressions whose evaluation depends on the same evaluation that occurred on a neighbour, thus abstracting communication actions and, crucially, enabling deep and straightforward integration in the Scala programming language, by the ScaFi incarnation. We cover syntax and informal semantics of FScaFi, provide examples of collective adaptive behaviour development in ScaFi, and delineate future work.
A Concurrent Argumentation Language for Negotiation and DebatingCarlo Taticchi
My PhD thesis proposal is to define a concurrent language for negotiation and debating through the use of argumentation. In our system, communication between intelligent agents (for instance, for content negotiation in a web server, or automatic debating systems) will be modelled as Argumentation Frameworks. High-level primitives will be provided in order to implement interactions, taking into account belief revision notions from AGM and KM theory. Furthermore, we plan to introduce in our language concepts as ‘ownership” for arguments and semantics preserving operations for better modelling real case applications.
Semidefinite programming, binary codes and a graph coloring problemChao Li
This slide gives a summary of my thesis project and gives some background in terms of semidefinite programming, code theory, and association schemes etc.
Introducing a Tool for Concurrent ArgumentationCarlo Taticchi
Agent-based modelling languages naturally implement concurrency for handling complex interactions between communicating agents. On the other hand, the field of Argumentation Theory lacks of instruments to explicitly model concurrent behaviours. In this paper we intro- duce a tool for dealing with concurrent argumentation processes and that can be used, for instance, to model agents debating, negotiating and persuading. The tool implements operations as expansion, contraction and revision. We also provide a web interface exposing the functionalities of the tool and allowing for a more careful study of concurrent processes.
Looking for Invariant Operators in ArgumentationCarlo Taticchi
We study invariant local expansion operators for conflict-free and admissible sets in Abstract Argumentation Frameworks (AFs). Such operators are directly applied on AFs, and are invariant with respect to a chosen “semantics” (that is w.r.t. each of the conflict free/admissible set of arguments). Accordingly, we derive a definition of robustness for AFs in terms of the number of times such operators can be applied without producing any change in the chosen semantics.
A Matrix Based Approach for Weighted Argumentation FrameworksCarlo Taticchi
The assignment of weights to attacks in a classical Argumentation Framework allows to compute semantics by taking into account the different importance of each argument. We represent a Weighted Argumentation Framework by a non-binary matrix, and we characterise the basic extensions (such as w-admissible, w-stable, w-complete) by analysing sub-blocks of this matrix. Also, we show how to reduce the matrix into another one of smaller size, that is equivalent to the original one for the determination of extensions. Furthermore, we provide two algorithms that allow to build incrementally w-grounded and w-preferred extensions starting from a w-admissible extension.
Looking for Invariant Operators in ArgumentationCarlo Taticchi
We study invariant local expansion operators for conflict-free and admissible sets in Abstract Argumentation Frameworks (AFs). Such operators are directly applied on AFs, and are invariant with respect to a chosen “semantics” (that is w.r.t. each of the conflict free/admissible set of arguments). Accordingly, we derive a definition of robustness for AFs in terms of the number of times such operators can be applied without producing any change in the chosen semantics.
While agent-based modelling languages naturally implement concurrency, the currently available languages for argumentation do not allow to explicitly model this type of interaction. In this paper we introduce a concurrent language for handling process arguing and communicating using a shared argumentation framework (reminding shared constraint store as in concurrent constraint). We introduce also basic expansions, contraction and revision procedures as main bricks for en- forcement, debate, negotiation and persuasion.
ITU - MDD - Textural Languages and GrammarsTonny Madsen
This presentation describes the use and design of textural domain specific language - DSL. It has two basic purposes:
Introduce you to some of the more important design criteria in language design
Introduce you to BNF
This presentation is developed for MDD 2010 course at ITU, Denmark.
Data produced by the Ozone PEATE from the Ozone Mapping and Profiler Suite (OMPS) instruments are to be stored in HDF5, not HDF-EOS, but will still need some features similar to those in HDF-EOS. In particular, a mechanism for handling dimension names will be needed. This poster proposes a method to handle dimension names for arrays in HDF5 in a manner commensurate with HDF-EOS5.
FScaFi: A Core Calculus for Collective Adaptive Systems ProgrammingRoberto Casadei
A recently proposed approach to the rigorous engineering of collective adaptive systems is the aggregate computing paradigm, which operationalises the idea of expressing collective adaptive behaviour by a global perspective as a functional composition of dynamic computational fields (i.e., structures mapping a collection of individual devices of a collective to computational values over time). In this paper, we present FScaFi, a core language that captures the essence of exploiting field computations in mainstream functional languages, and which is based on a semantic model for field computations leveraging the novel notion of “computation against a neighbour”. Such a construct models expressions whose evaluation depends on the same evaluation that occurred on a neighbour, thus abstracting communication actions and, crucially, enabling deep and straightforward integration in the Scala programming language, by the ScaFi incarnation. We cover syntax and informal semantics of FScaFi, provide examples of collective adaptive behaviour development in ScaFi, and delineate future work.
A Concurrent Argumentation Language for Negotiation and DebatingCarlo Taticchi
My PhD thesis proposal is to define a concurrent language for negotiation and debating through the use of argumentation. In our system, communication between intelligent agents (for instance, for content negotiation in a web server, or automatic debating systems) will be modelled as Argumentation Frameworks. High-level primitives will be provided in order to implement interactions, taking into account belief revision notions from AGM and KM theory. Furthermore, we plan to introduce in our language concepts as ‘ownership” for arguments and semantics preserving operations for better modelling real case applications.
Semidefinite programming, binary codes and a graph coloring problemChao Li
This slide gives a summary of my thesis project and gives some background in terms of semidefinite programming, code theory, and association schemes etc.
... or how to query an RDF graph with 28 billion triples in a standard laptop
These slides correspond to my talk at the Stanford Center for Biomedical Informatics, on 25th April 2018
Выступление Сергея Кольцова (НИУ ВШЭ) на International Conference on Big Data and its Applications (ICBDA).
ICBDA — конференция для предпринимателей и разработчиков о том, как эффективно решать бизнес-задачи с помощью анализа больших данных.
http://icbda2015.org/
Survey of Generative Clustering Models 2008Roman Stanchak
Survey of Generative Clustering Models "Probabilistic Topic Models" circa 2008. Class presentation by Roman Stanchak and Prithviraj Sen for University of Maryland College Park cmsc828g, Link Mining and Dynamic Graph Analysis. Spring 2008. Instructor: Prof. Lise Getoor
Mathematical Modeling for Practical ProblemsLiwei Ren任力偉
Mathematical modeling is an important step for developing many advanced technologies in various domains such as network security, data mining and etc… This lecture introduces a process that the speaker summarizes from his past practice of mathematical modeling and algorithmic solutions in IT industry, as an applied mathematician, algorithm specialist or software engineer , and even as an entrepreneur. A practical problem from DLP system will be used as an example for creating math models and providing algorithmic solutions.
KDD 2014 Presentation (Best Research Paper Award): Alias Topic Modelling (Red...Aaron Li
Video (2014): http://videolectures.net/kdd2014_li_sampling_complexity/
This paper presents an approximate sampler for topic models that theoretically and experimentally outperforms existing samplers thereby allowing topic models to scale to industry-scale datasets.
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
A Large number of digital text information is generated every day. Effectively searching,
managing and exploring the text data has become a main task. In this paper, we first represent
an introduction to text mining and a probabilistic topic model Latent Dirichlet allocation. Then
two experiments are proposed - Wikipedia articles and users’ tweets topic modelling. The
former one builds up a document topic model, aiming to a topic perspective solution on
searching, exploring and recommending articles. The latter one sets up a user topic model,
providing a full research and analysis over Twitter users’ interest. The experiment process
including data collecting, data pre-processing and model training is fully documented and
commented. Further more, the conclusion and application of this paper could be a useful
computation tool for social and business research.
Similar to EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing (20)
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequence pattern indexing
1. Top-k Interesting Phrase Mining in Ad-hoc Collections
using Sequence Pattern Indexing
Chuancong Gao, Sebastian Michel
Max-Planck Institute for Informatics & Saarland University
Saarbr¨ucken, Germany
EDBT, March 28, 2012
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 1 / 31
2. Motivation
Lost overview on current presidential campaign topics?
Get an update on key phrases of Mitt Romney!
• Want concise summary.
• Solution: Mining top-k phrases w.r.t. subset of all documents.
• Sub-set of documents is determined in ad-hoc fashion.
• E.g., using keyword queries, restriction to categorical attributes.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 2 / 31
3. Motivation (Example)
1988 George Bush Kinder, Gentler Nation
1992 Bill Clinton Dont stop thinking about tomorrow
1992 Bill Clinton Putting People First
1992 Ross Perot Ross for Boss
1996 Bill Clinton Building a bridge to the 21st century
1996 Bob Dole The Better Man for a Better America
2000 Al Gore Prosperity and progress
2000 George W. Bush Compassionate conservatism
2000 George W. Bush Leave no child behind
2000 Ralph Nader Government of, by, and for the people...
2004 John Kerry Let America be America Again
2004 George W. Bush Yes, America Can!
2008 John McCain Country First
2008 Barack Obama Change We Can Believe In
2008 Barack Obama Yes We Can!
http://www.presidentsusa.net/campaignslogans.html
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 3 / 31
4. Problem Statement
Given: a document corpus
Task: Create an Index
• that organizes documents and phrases
• in a compact way
• to query for top-k phrases of arbitrary ad-hoc collections.
• Devise appropriate algorithms to use the index efficiently.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 4 / 31
5. Problem Definition
Given a document corpus D and a subset D .
Find the top-k frequent phrases with the highest interestingness in D .
Ad-hoc Document Subset
D is not known upfront
Phrase
A phrase p is contained in document d iff p is a continuous sub-sequence
of d (notation p d).
• p’s length is restricted by min-len and max-len.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 5 / 31
6. Problem Definition (Cont.)
Frequency
The frequency of a phrase p in D is called the support value of p w.r.t.
D , denoted by supD
p .
• supD
p = |{d|d ∈ D ∧ p d}|
• A phrase is called frequent in a document collection D iff p’s support
is larger than the user-specified threshold min-supD .
Interestingness
The interestingness measure we adopted is the confidence value (or
relevance value), defined as
conf D
p =
supD
p
supD
p
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 6 / 31
7. Example
Find the top-k (k = 3) interesting phrases for 4 documents.
D
D
ID Document
d1 eabefcad
d2 dcbeafe
d3 fceacd
d4 acfdbeacd
(a) Documents
k
ID Phrase Conf.
p1 ac 2/2
p2 bea 2/2
p3 ea 3/4
p4 be 2/3
· · · · · ·
(b) Top-k Phrases
(min-supD = 2, k = 3, min-len = 2, max-len = 4)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 7 / 31
9. Existing Approaches - Phrase Inverted Indexing
(A. Simitsis et al. - Multidimensional content eXploration. VLDB’08)
Indexing Steps
For each phrase p, store an inverted list that contains the IDs of all
documents containing p.
Example index:
ID Document IDs
p1 {d1, d2}
p2 {d1, d3}
p3 {d1, d3, d4}
p4 {d1, d2, d3, d4}
Given an ad-hoc collection D : Alg. requires a full scan of all the inverted
lists.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 9 / 31
10. Existing Approaches - Forward Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Indexing Steps
Builds a forward list for each document d ∈ D, containing all the IDs of
phrases contained in d.
Example index structure:
ID Forward List
d1 {p1:2, p2:2, p3:3, p4:4}
d2 {p1:2, p4:4}
d3 {p2:2, p3:3, p4:4}
d4 {p3:3, p4:4}
Numbers after “:” denote global support values.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 10 / 31
11. Existing Approaches - Forward Indexing (Cont.)
Query Processing
Compute top-k phrases with |D |-way merge join.
• Global support value of p stored explicitly in list.
• Local support value + confidence is computed during the join process.
• Nice: only the forward lists of documents in D need be to scanned.
• Even better: can apply early termination:
• phrase IDs in each forward list stored in ascending order of global
support values
• then, conf D
p has an upper bound min 1,
|D |
supD
p
.
• Can stop if the top-k results we found can not be topped anymore.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 11 / 31
12. Prefix-Maximal Indexing
(S. Bedathur et al. - Interesting-Phrase Mining for Ad-Hoc Text Analytics.
VLDB’10)
Key Idea
Store only prefix-maximal phrases (small subset of all phrases) w.r.t. each
document in the corpus.
• A phrase p is called maximal w.r.t. a document d ∈ D iff ∃p that p
is frequent and p p .
• A phrase p is called prefix-maximal w.r.t. a document d ∈ D iff p is
maximal and p that p is maximal and p is a prefix of p .
Significantly reduces the index size comparing to Forward Indexing.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 12 / 31
14. Existing Approaches - Phrase-Maximal Indexing (Cont.)
The top-k phrases are computed by |D |-way merge join.
• Support values computed when merging all the forward lists in D .
• The confidence values are calculated by using a separate global
support value dictionary.
• Issues:
• A full scan of the forward lists of D is required, since no early
termination can be applied.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 14 / 31
15. Objectives
Devise an approach that
1. is as fast as Forward Indexing (i.e., we must devise
early stopping)
2. and has an index as small (or smaller than)
Phrase-Maximal Indexing
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 15 / 31
17. Our Approach - Key Ideas
We also store prefix-maximal phrases for each document d ∈ D, but ...
Lexicographical order vs. matching position order:
ID Prefix-Maximal Phrases
d1 {ad, befc, cad, ea, efca, fcad}
d2 {bea, cbea, dcbe, ea, fe}
d3 {acd, cd, ceac, eacd, fcea}
d4 {acd, acfd, bea, cfd, eac, fd}
· · · · · ·
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
d2 {dcbe@1, cbea@2, bea@3, ea@4, fe@6}
d3 {fcea@1, ceac@2, eacd@3, acd@4, cd@5}
d4 {acfd@1, cfd@2, fd@3, bea@5, eacd@6, acd@7}
· · · · · ·
– Duplicate Prefix Sub-Phrase Comparing to Previous Phrase – Duplicate Prefixes among Phrases
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 17 / 31
18. Our Approach - Indexing
ID Prefix-Maximal Phrases (Ordered by Position)
d1 {ea@1, befc@3, efca@4, fcad@5, cad@6, ad@7}
a
1 2
e
1
b
1
e
2
f
1
3
c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Sub-phrases are organized in a forest structure; each node contains
one item + length information.
• Links from outside point to start items of indexed prefix-maximal
phrases.
• Possible length information is stored in bit array.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 18 / 31
19. Memory Layout
a
1 2
e
1
b
1
e
2
f
1
3 c
2
4
f2
c3
a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
ae d f c a e f cb
00
0010
0000
0011
0000
0001
000014
0101
0000
0010
1000
0011
0100
0001
0000
0000
1000
0100
0100
0010
0000
00 04 06 10 14 18 22 26 30 34 38
00
06 10 26 34
8 bits24 bits1+15 bits
16 bits
a d0001
1000
0100
1100
42 46
38
50
0100
1100
Word / Link
End of Branch
Possible Lengths (1-6)
(Number below cells indicate a relative memory address)
Each node is assigned with 4 bytes (32 bits); the first byte storing flags, the last 3 bytes store
the item.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 19 / 31
20. Our Approach - Indexing (Cont.) - Improving
a1 2
e1
b1
e2
f1
3
c2
4
f
2
c3 a4
a3
d d4
1 2
3
2
n1 n2 n3 n4 n5 n6
n7 n8 n9
n10 n11n12
Possible Lengths (1-4)
• Still some duplications like e → f → c → a and a → d.
• Further improvement: utilize specific matching positions (instead of
just matching orders).
• Idea: Use graph structure to eliminate more duplicates.
a
1 2
e1
b1
e2
f1
3
c
2
4 a3 d4
1 2
3
n1 n2 n3 n4 n5 n6 n7 n8
2
3 4
2
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 20 / 31
21. Our Approach - Top-k Phrase Computing
Computation is done in two steps (interleaved): |D |-way merge join and
growing patterns:
Merge join approach is used on phrase prefixes with length 1
• Index contains an address list for different items in document d.
• Apply |D |-way merge join on these address lists to retrieve all
length-1 phrases.
Extend each length-1 phrase
In each step: the prefix is extended with one of the local frequent items.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 21 / 31
22. Optimizations
Early Stopping
• Can tell when it is “OK” to stop retrieving more length 1 patterns
Search Space Pruning
• During expansion of a length-1 pattern to larger patterns, stop when
unpromising.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 22 / 31
23. Our Approach - Summary
Our Method’s Advantages:
• Index with little redundancy
• Algorithms fully utilize this new index structure.
• Visiting only frequent phrases:
• With early termination in |D |-merge join process.
• And search space pruning in pattern-growth process.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 23 / 31
25. Evaluation
Different versions of own approach, plus two competitors.
Using an Intel Xeon W3520 (2.66 GHz) CPU and 24 GB memory
workstation.
Implementation in C#. Index in memory.
Algorithms under Comparison
Algorithm Description
Baselines
ForwardIndex Forward Indexing
PreMaxIndex Prefix-Maximal Indexing
Our Approaches
SeqPattIndex Our Indexing
SeqPattIndex IM Our Improved Indexing
SeqPattIndex NP SeqPattIndex With Early Termination and
Search Space Pruning Turned off
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 25 / 31
26. Datasets
Global Per Document
Dataset #Doc. #Word Max. Len. Avg. Len. Avg. #Word
PubMed Titles 17,826,927 2,028,673 167 11.59 10.99
PubMed Abstracts 2,500,000 2,075,526 1599 149.72 92.1
Extracted from PubMed – the largest free publication database on life
sciences and biomedical topics.
• Titles are rather short, little redundancy per title.
• Abstracts contain much more information, more redundancy
document.
Great to study pros and cons of the discussed approaches.
Queries: Generated
• to simulate different kind of queries (by size of D )
• grouped phrases by global support according to several ranges
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 26 / 31
27. Evaluation - In-Memory Index Size (in MB)
SeqPattIndex
SeqPattIndex_IM
PreMaxIndex
ForwardIndex
1000
2000
3000
4000
5000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(a) PubMed Titles
1000
2000
3000
4000
5000
6000
7000
5 10 15 20 25
IndexSize(inMB)
Minimum Support
(b) PubMed Abstracts
• PubMed Titles: improved approach close to original one, as phrases in titles have little
redundancy
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 27 / 31
28. Evaluation - Average Querying Time (in sec)
PubMed Titles
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.001
0.01
0.1
1
10
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10 100 1000 10000
Avg.QueryingTime(insec)
k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 28 / 31
29. Evaluation - Average Querying Time (in sec)
PubMed Abstract
SeqPattIndex SeqPattIndex_NP PreMaxIndex ForwardIndex
0.01
0.1
1
10
100
100 1000 10000 100000
Avg.QueryingTime(insec)
|D’|
(a) Varying |D | (k = 10, min-sup = 5)
0.1
1
10
100
10000
10 100 1000 10000
Avg.QueryingTime(insec) k
(b) Varying k (min-sup = 5)
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 29 / 31
31. Conclusion
• Presented approach to ad-hoc frequent phrases mining
• Making use of redundancy of sequentially aligned
phrases
• Index is very compact.
• High performance in query processing based on early
stopping and search space pruning.
C. Gao & S. Michel, EDBT 2012, Berlin Interesting Phrase Mining 31 / 31