The document discusses sequential pattern mining algorithms. It begins by introducing sequential patterns and challenges in mining them from transaction databases. It then describes the Apriori-based GSP algorithm, which generates candidate sequences level-by-level and scans the database multiple times. The document also introduces pattern-growth methods like PrefixSpan that avoid candidate generation by projecting databases based on prefixes. Finally, it discusses optimizations like pseudo-projection that speed up sequential pattern mining.
Introduction To Multilevel Association Rule And Its MethodsIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
Machine Learning and Data Mining: 04 Association Rule MiningPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This lecture introduces association rule mining and the Apriori algorithm
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This slide first introduces the sequential pattern mining problem and also presents some required definitions in order to understand GSP algorithm. At then end there is a brief introduction of GSP algorithm and some practical constraints which it supports.
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Introduction To Multilevel Association Rule And Its MethodsIJSRD
Association rule mining is a popular and well researched method for discovering interesting relations between variables in large databases. In this paper we introduce the concept of Data mining, Association rule and Multilevel association rule with different algorithm, its advantage and concept of Fuzzy logic and Genetic Algorithm. Multilevel association rules can be mined efficiently using concept hierarchies under a support-confidence framework.
Machine Learning and Data Mining: 04 Association Rule MiningPier Luca Lanzi
Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. This lecture introduces association rule mining and the Apriori algorithm
Slide explaining the distinction between bagging and boosting while understanding the bias variance trade-off. Followed by some lesser known scope of supervised learning. understanding the effect of tree split metric in deciding feature importance. Then understanding the effect of threshold on classification accuracy. Additionally, how to adjust model threshold for classification in supervised learning.
Note: Limitation of Accuracy metric (baseline accuracy), alternative metrics, their use case and their advantage and limitations were briefly discussed.
This slide first introduces the sequential pattern mining problem and also presents some required definitions in order to understand GSP algorithm. At then end there is a brief introduction of GSP algorithm and some practical constraints which it supports.
In this tutorial, we will learn the the following topics -
+ Linear SVM Classification
+ Soft Margin Classification
+ Nonlinear SVM Classification
+ Polynomial Kernel
+ Adding Similarity Features
+ Gaussian RBF Kernel
+ Computational Complexity
+ SVM Regression
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Leveraging data to become more customer-centric is a key factor for online retail sales. Using a host of Machine learning techniques like recommender systems, image analytics, customer churn and demand prediction- can impact sales, customer loyalty & improve revenues
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Data Mining is newly technology and it's very useful for Data analytics for business analysis purpose and decision making data. This PPT described Data Mining in very easy way.
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul
Jiawei Han, Micheline Kamber and Jian Pei
Data Mining: Concepts and Techniques, 3rd ed.
The Morgan Kaufmann Series in Data Management Systems
Morgan Kaufmann Publishers, July 2011. ISBN 978-0123814791
Leveraging data to become more customer-centric is a key factor for online retail sales. Using a host of Machine learning techniques like recommender systems, image analytics, customer churn and demand prediction- can impact sales, customer loyalty & improve revenues
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
Data Mining is newly technology and it's very useful for Data analytics for business analysis purpose and decision making data. This PPT described Data Mining in very easy way.
International Journal of Computational Engineering Research (IJCER) is dedicated to protecting personal information and will make every reasonable effort to handle collected information appropriately. All information collected, as well as related requests, will be handled as carefully and efficiently as possible in accordance with IJCER standards for integrity and objectivity.
Graph based Approach and Clustering of Patterns (GACP) for Sequential Pattern...AshishDPatel1
The sequential pattern mining generates the sequential patterns. It can be used as the input of another program for retrieving the information from the large collection of data. It requires a large amount of memory as well as numerous I/O operations. Multistage operations reduce the efficiency of the
algorithm. The given GACP is based on graph representation and avoids recursively reconstructing intermediate trees during the mining process. The algorithm also eliminates the need of repeatedly scanning the database. A graph used in GACP is a data structure accessed starting at its first node called root and each node of a graph is either a leaf or an interior node. An interior node has one or more child nodes, thus from the root to any node in the graph defines a sequence. After construction of the graph the pruning technique called clustering is used to retrieve the records from the graph. The algorithm can be used to mine the database using compact memory based data structures and cleaver pruning methods.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
3. Chapter 8. Mining Stream, Time-
Series, and Sequence Data
Mining data streams
Mining time-series data
Mining sequence patterns in
transactional databases
Mining sequence patterns in biological
data
April 18, 2013 Data Mining: Concepts and Techniques
3
4. Sequence Databases & Sequential
Patterns
Transaction databases, time-series databases vs. sequence
databases
Frequent patterns vs. (frequent) sequential patterns
Applications of sequential pattern mining
Customer shopping sequences:
First buy computer, then CD-ROM, and then digital
camera, within 3 months.
Medical treatments, natural disasters (e.g., earthquakes),
science & eng. processes, stocks and markets, etc.
Telephone calling patterns, Weblog click streams
DNA sequences and gene structures
April 18, 2013 Data Mining: Concepts and Techniques
4
5. What Is Sequential Pattern
Mining?
Given a set of sequences, find the complete set of
frequent subsequences
A sequence : < (ef) (ab) (df) c b >
A sequence database
SID sequence An element may contain a set of items.
10 <a(abc)(ac)d(cf)> Items within an element are unordered
and we list them alphabetically.
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb> <a(bc)dc> is a subsequence
40 <eg(af)cbc> of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
April 18, 2013 Data Mining: Concepts and Techniques
5
6. Challenges on Sequential Pattern Mining
A huge number of possible sequential patterns are
hidden in databases
A mining algorithm should
find the complete set of patterns, when possible,
satisfying the minimum support (frequency) threshold
be highly efficient, scalable, involving only a small
number of database scans
be able to incorporate various kinds of user-specific
constraints
April 18, 2013 Data Mining: Concepts and Techniques
6
7. Sequential Pattern Mining Algorithms
Concept introduction and an initial Apriori-like algorithm
Agrawal & Srikant. Mining sequential patterns, ICDE’95
Apriori-based method: GSP (Generalized Sequential Patterns:
Srikant & Agrawal @ EDBT’96)
Pattern-growth methods: FreeSpan & PrefixSpan (Han et
al.@KDD’00; Pei, et al.@ICDE’01)
Vertical format-based mining: SPADE (Zaki@Machine Leanining’00)
Constraint-based sequential pattern mining (SPIRIT: Garofalakis,
Rastogi, Shim@VLDB’99; Pei, Han, Wang @ CIKM’02)
Mining closed sequential patterns: CloSpan (Yan, Han & Afshar
@SDM’03)
April 18, 2013 Data Mining: Concepts and Techniques
7
8. The Apriori Property of Sequential
Patterns
A basic property: Apriori (Agrawal & Sirkant’94)
If a sequence S is not frequent
Then none of the super-sequences of S is frequent
E.g, <hb> is infrequent so do <hab> and <(ah)b>
Seq. ID Sequence
Given support threshold
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 18, 2013 Data Mining: Concepts and Techniques
8
9. GSP—Generalized Sequential Pattern Mining
GSP (Generalized Sequential Pattern) mining algorithm
proposed by Agrawal and Srikant, EDBT’96
Outline of the method
Initially, every item in DB is a candidate of length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k+1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate can
be found
Major strength: Candidate pruning by Apriori
April 18, 2013 Data Mining: Concepts and Techniques
9
10. Finding Length-1 Sequential Patterns
Examine GSP using an example
Initial candidates: all singleton sequences
<a>, <b>, <c>, <d>, <e>, <f>, <g>, <h>
Cand Sup
<a> 3
Scan database once, count support for
candidates <b> 5
<c> 4
min_sup =2 <d> 3
Seq. ID Sequence
10 <(bd)cb(ac)>
<e> 3
20 <(bf)(ce)b(fg)> <f> 2
30 <(ah)(bf)abf> <g> 1
40 <(be)(ce)d>
<h> 1
50 <a(bd)bcb(ade)>
April 18, 2013 Data Mining: Concepts and Techniques
10
12. The GSP Mining Process
5th scan: 1 cand. 1 length-5 seq. <(bd)cba> Cand. cannot
pat. pass sup.
threshold
4th scan: 8 cand. 6 length-4 seq. <abba> <(bd)bc> … Cand. not in DB at all
pat.
3rd scan: 47 cand. 19 length-3 seq. <abb> <aab> <aba> <baa> <bab> …
pat. 20 cand. not in DB at all
2nd scan: 51 cand. 19 length-2 seq.
pat. 10 cand. not in DB at all <aa> <ab> … <af> <ba> <bb> … <ff> <(ab)> … <(ef)>
1st scan: 8 cand. 6 length-1 seq.
pat. <a> <b> <c> <d> <e> <f> <g> <h>
Seq. ID Sequence
10 <(bd)cb(ac)>
min_sup =2
20 <(bf)(ce)b(fg)>
30 <(ah)(bf)abf>
40 <(be)(ce)d>
50 <a(bd)bcb(ade)>
April 18, 2013 Data Mining: Concepts and Techniques
12
13. Candidate Generate-and-test:
Drawbacks
A huge set of candidate sequences generated.
Especially 2-item candidate sequence.
Multiple Scans of database needed.
The length of each candidate grows by one at each
database scan.
Inefficient for mining long sequential patterns.
A long pattern grow up from short patterns
The number of short patterns is exponential to the
length of mined patterns.
April 18, 2013 Data Mining: Concepts and Techniques
13
14. The SPADE Algorithm
SPADE (Sequential PAttern Discovery using Equivalent
Class) developed by Zaki 2001
A vertical format sequential pattern mining method
A sequence database is mapped to a large set of
Item: <SID, EID>
Sequential pattern mining is performed by
growing the subsequences (patterns) one item at a time
by Apriori candidate generation
April 18, 2013 Data Mining: Concepts and Techniques
14
16. Bottlenecks of GSP and SPADE
A huge set of candidates could be generated
1,000 frequent length-1 sequences generate s huge number of
1000 × 999
length-2 candidates! 1000 ×1000 + = 1,499,500
2
Multiple scans of database in mining
Breadth-first search
Mining long sequential patterns
Needs an exponential number of short candidates
100
100 100
A length-100 sequential pattern needs 10
∑ i
30
= 2 − 1 ≈ 1030
i =1
candidate sequences!
April 18, 2013 Data Mining: Concepts and Techniques
16
17. Prefix and Suffix (Projection)
<a>, <aa>, <a(ab)> and <a(abc)> are prefixes of
sequence <a(abc)(ac)d(cf)>
Given sequence <a(abc)(ac)d(cf)>
Prefix Suffix (Prefix-Based Projection)
<a> <(abc)(ac)d(cf)>
<aa> <(_bc)(ac)d(cf)>
<ab> <(_c)(ac)d(cf)>
April 18, 2013 Data Mining: Concepts and Techniques
17
18. Mining Sequential Patterns by Prefix
Projections
Step 1: find length-1 sequential patterns
<a>, <b>, <c>, <d>, <e>, <f>
Step 2: divide search space. The complete set of seq. pat.
can be partitioned into 6 subsets:
The ones having prefix <a>;
The ones having prefix <b>;
SID sequence
…
10 <a(abc)(ac)d(cf)>
The ones having prefix <f>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
April 18, 2013 Data Mining: Concepts and Techniques
18
19. Finding Seq. Patterns with Prefix
<a>
Only need to consider projections w.r.t. <a>
<a>-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)
(ae)>, <(_b)(df)cb>, <(_f)cbc>
Find all the length-2 seq. pat. Having prefix <a>: <aa>,
<ab>, <(ab)>, <ac>, <ad>, <af>
Further partition into 6 subsets SID sequence
Having prefix <aa>; 10 <a(abc)(ac)d(cf)>
… 20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
Having prefix <af>
40 <eg(af)cbc>
April 18, 2013 Data Mining: Concepts and Techniques
19
20. Completeness of PrefixSpan
SDB
SID sequence
Length-1 sequential patterns
10 <a(abc)(ac)d(cf)>
<a>, <b>, <c>, <d>, <e>, <f>
20 <(ad)c(bc)(ae)>
30 <(ef)(ab)(df)cb>
40 <eg(af)cbc>
Having prefix <a> Having prefix <c>, …, <f>
Having prefix <b>
<a>-projected database
<(abc)(ac)d(cf)> Length-2 sequential
<b>-projected database …
<(_d)c(bc)(ae)> patterns
<(_b)(df)cb> <aa>, <ab>, <(ab)>,
<(_f)cbc> <ac>, <ad>, <af>
……
Having prefix <aa> Having prefix <af>
<aa>-proj. db … <af>-proj. db
April 18, 2013 Data Mining: Concepts and Techniques
20
21. Efficiency of PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected
databases
Can be improved by pseudo-projections
April 18, 2013 Data Mining: Concepts and Techniques
21
22. Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Postfixes of sequences often appear
repeatedly in recursive projected databases
When (projected) database can be held in main
memory, use pointers to form projections
Pointer to the sequence s=<a(abc)(ac)d(cf)>
Offset of the postfix <a>
s|<a>: ( , 2) <(abc)(ac)d(cf)>
<ab>
s|<ab>: ( , 4) <(_c)(ac)d(cf)>
April 18, 2013 Data Mining: Concepts and Techniques
22
23. Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying postfixes
Efficient in running time and space when database
can be held in main memory
However, it is not efficient when database cannot fit in
main memory
Disk-based random accessing is very costly
Suggested Approach:
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the data set
fits in memory
April 18, 2013 Data Mining: Concepts and Techniques
23
24. Performance on Data Set
C10T8S8I8
April 18, 2013 Data Mining: Concepts and Techniques
24
25. Performance on Data Set Gazelle
April 18, 2013 Data Mining: Concepts and Techniques
25
27. CloSpan: Mining Closed Sequential
Patterns
A closed sequential pattern s:
there exists no superpattern
s’ such that s’ כs, and s’ and
s have the same support
Motivation: reduces the
number of (redundant)
patterns but attains the same
expressive power
Using Backward Subpattern
and Backward Superpattern
pruning to prune redundant
search space
April 18, 2013 Data Mining: Concepts and Techniques
27
29. Constraint-Based Seq.-Pattern Mining
Constraint-based sequential pattern mining
Constraints: User-specified, for focused mining of desired
patterns
How to explore efficient mining with constraints? —
Optimization
Classification of constraints
Anti-monotone: E.g., value_sum(S) < 150, min(S) > 10
Monotone: E.g., count (S) > 5, S ⊇ {PC, digital_camera}
Succinct: E.g., length(S) ≥ 10, S ∈ {Pentium, MS/Office,
MS/Money}
Convertible: E.g., value_avg(S) < 25, profit_sum (S) >
160, max(S)/avg(S) < 2, median(S) – min(S) > 5
Inconvertible: E.g., avg(S) – median(S) = 0
April 18, 2013 Data Mining: Concepts and Techniques
29
30. From Sequential Patterns to Structured Patterns
Sets, sequences, trees, graphs, and other structures
Transaction DB: Sets of items
{{i , i , …, i }, …}
1 2 m
Seq. DB: Sequences of sets:
{<{i , i }, …, {i , i , i }>, …}
1 2 m n k
Sets of Sequences:
{{<i , i >, …, <i , i , i >}, …}
1 2 m n k
Sets of trees: {t1, t2, …, tn}
Sets of graphs (mining for frequent subgraphs):
{g , g , …, g }
1 2 n
Mining structured patterns in XML documents, bio-
April 18, 2013 structures, etc.
chemical Data Mining: Concepts and Techniques
30
31. Episodes and Episode Pattern Mining
Other methods for specifying the kinds of patterns
Serial episodes: A → B
Parallel episodes: A & B
Regular expressions: (A | B)C*(D → E)
Methods for episode pattern mining
Variations of Apriori-like algorithms, e.g., GSP
Database projection-based pattern growth
Similar to the frequent pattern growth without
candidate generation
April 18, 2013 Data Mining: Concepts and Techniques
31
32. Periodicity Analysis
Periodicity is everywhere: tides, seasons, daily power
consumption, etc.
Full periodicity
Every point in time contributes (precisely or
approximately) to the periodicity
Partial periodicit: A more general notion
Only some segments contribute to the periodicity
Jim reads NY Times 7:00-7:30 am every week day
Cyclic association rules
Associations which form cycles
Methods
Full periodicity: FFT, other statistical analysis methods
Partial and cyclic periodicity: Variations of Apriori-like
mining methods
April 18, 2013 Data Mining: Concepts and Techniques
32
33. Ref: Mining Sequential
Patterns
R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance
improvements. EDBT’96.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event
sequences. DAMI:97.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences. Machine
Learning, 2001.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan: Mining Sequential
Patterns Efficiently by Prefix-Projected Pattern Growth. ICDE'01 (TKDE’04).
J. Pei, J. Han and W. Wang, Constraint-Based Sequential Pattern Mining in Large
Databases, CIKM'02.
X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential Patterns in Large
Datasets. SDM'03.
J. Wang and J. Han, BIDE: Efficient Mining of Frequent Closed Sequences, ICDE'04.
H. Cheng, X. Yan, and J. Han, IncSpan: Incremental Mining of Sequential Patterns in
Large Database, KDD'04.
J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns in Time Series
Database, ICDE'99.
J. Yang, W. Wang, and P. S. Yu, Mining asynchronous periodic patterns in time series
data, KDD'00.
April 18, 2013 Data Mining: Concepts and Techniques
33