The document summarizes the LASH algorithm for mining sequential patterns from sequence data with hierarchies. LASH extends traditional sequential pattern mining to handle hierarchies among items. It first defines how sequences can be generalized based on item hierarchies. It then partitions the sequence database based on the most frequent items and mines generalized patterns within each partition. Key steps include identifying relevant items, generalizing sequences, and representing equivalent sequences compactly to efficiently find all frequent generalized sequences satisfying maximum length and gap constraints.
On the Choice of Compressor Pressure in the Process of Pneumatic Transport to...BRNSS Publication Hub
In this paper, we consider a possibility to choose value of pressure of devices for pneumatic transport to increase energy saving. We also introduce an analytical approach for the prognosis of transport of free-flowing materials to estimate velocity of the transport and choosing of the required value of pressure.
As many things in the history of analysis of algorithms the all-pairs shortest path has a long history (From the point of view of Computer Science). We can see the initial results from the book “Studies in the Economics of Transportation” by Beckmann, McGuire, and Winsten (1956) where the notation that we use for
the matrix multiplication alike was first used.
In this slides, we go trough several all pairs Shortest path problem solutions from a slow version to the Johnson algorithm.
This ppt covers following topic of unit - 1 of B.Sc. 1 Calculus :- Definition of limit , left & right hand limit and its example , continuity & its related example.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
Finding All Maximal Cliques in Very Large Social NetworksAntonio Maccioni
The detection of communities in social networks is a challenging task. A rigorous way to model communities considers maximal cliques, that is, maximal subgraphs in which each pair of nodes is connected by an edge. State-of-the-art strategies for finding maximal cliques in very large networks decompose the network in blocks and then perform a distributed computation. These approaches exhibit a trade-off between efficiency and completeness: decreasing the size of the blocks has been shown to improve efficiency but some cliques may remain undetected since high-degree nodes, also called hubs, may not fit with all their neighborhood into a small block. In this paper, we present a distributed approach that, by suitably handling hub nodes, is able to detect maximal cliques in large networks meeting both completeness and efficiency. The approach relies on a two-level decomposition process. The first level aims at recursively identifying and isolating tractable portions of the network. The second level further decomposes the tractable portions into small blocks. We demonstrate that this process is able to correctly detect all maximal cliques, provided that the sparsity of the network is bounded, as it is the case of real-world social networks. An extensive campaign of experiments confirms the effectiveness, efficiency, and scalability of our solution and shows that, if hub nodes were neglected, significant cliques would be undetected.
On the Choice of Compressor Pressure in the Process of Pneumatic Transport to...BRNSS Publication Hub
In this paper, we consider a possibility to choose value of pressure of devices for pneumatic transport to increase energy saving. We also introduce an analytical approach for the prognosis of transport of free-flowing materials to estimate velocity of the transport and choosing of the required value of pressure.
As many things in the history of analysis of algorithms the all-pairs shortest path has a long history (From the point of view of Computer Science). We can see the initial results from the book “Studies in the Economics of Transportation” by Beckmann, McGuire, and Winsten (1956) where the notation that we use for
the matrix multiplication alike was first used.
In this slides, we go trough several all pairs Shortest path problem solutions from a slow version to the Johnson algorithm.
This ppt covers following topic of unit - 1 of B.Sc. 1 Calculus :- Definition of limit , left & right hand limit and its example , continuity & its related example.
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
Finding All Maximal Cliques in Very Large Social NetworksAntonio Maccioni
The detection of communities in social networks is a challenging task. A rigorous way to model communities considers maximal cliques, that is, maximal subgraphs in which each pair of nodes is connected by an edge. State-of-the-art strategies for finding maximal cliques in very large networks decompose the network in blocks and then perform a distributed computation. These approaches exhibit a trade-off between efficiency and completeness: decreasing the size of the blocks has been shown to improve efficiency but some cliques may remain undetected since high-degree nodes, also called hubs, may not fit with all their neighborhood into a small block. In this paper, we present a distributed approach that, by suitably handling hub nodes, is able to detect maximal cliques in large networks meeting both completeness and efficiency. The approach relies on a two-level decomposition process. The first level aims at recursively identifying and isolating tractable portions of the network. The second level further decomposes the tractable portions into small blocks. We demonstrate that this process is able to correctly detect all maximal cliques, provided that the sparsity of the network is bounded, as it is the case of real-world social networks. An extensive campaign of experiments confirms the effectiveness, efficiency, and scalability of our solution and shows that, if hub nodes were neglected, significant cliques would be undetected.
ACM SIGMOD SBD2016 - Querying and reasoning over large scale building dataset...Pieter Pauwels
Presentation at the International Workshop on Semantic Big Data (SBD 2016), held in conjunction with the 2016 ACM SIGMOD Conference in San Francisco, USA. Authored by Pieter Pauwels, Tarcisio Mendes de Farias, Chi Zhang, Ana Roxin, Jakob Beetz, Jos De Roo, Christophe Nicolle.
Can you trust the internet? An introduction to graph theory, computational co...Denise Gosnell, Ph.D.
In this presentation, Denise weaves together technical details from graph theory, computational complexity, and cryptography to ultimately discuss if the internet is secure. The discussion centers around whether or not N=NP.
The video of the event will be available on YouTube here: http://youtu.be/jCvROOijhoY
Matrix Transformations on Some Difference Sequence SpacesIOSR Journals
The sequence spaces 𝑙∞(𝑢,𝑣,Δ), 𝑐0(𝑢,𝑣,Δ) and 𝑐(𝑢,𝑣,Δ) were recently introduced. The matrix classes (𝑐 𝑢,𝑣,Δ :𝑐) and (𝑐 𝑢,𝑣,Δ :𝑙∞) were characterized. The object of this paper is to further determine the necessary and sufficient conditions on an infinite matrix to characterize the matrix classes (𝑐 𝑢,𝑣,Δ ∶𝑏𝑠) and (𝑐 𝑢,𝑣,Δ ∶ 𝑙𝑝). It is observed that the later characterizations are additions to the existing ones
https://telecombcn-dl.github.io/2017-dlai/
Deep learning technologies are at the core of the current revolution in artificial intelligence for multimedia data analysis. The convergence of large-scale annotated datasets and affordable GPU hardware has allowed the training of neural networks for data analysis tasks which were previously addressed with hand-crafted features. Architectures such as convolutional neural networks, recurrent neural networks or Q-nets for reinforcement learning have shaped a brand new scenario in signal processing. This course will cover the basic principles of deep learning from both an algorithmic and computational perspectives.
Dual Spaces of Generalized Cesaro Sequence Space and Related Matrix Mappinginventionjournals
In this paper we define the generalized Cesaro sequence spaces 푐푒푠(푝, 푞, 푠). We prove the space 푐푒푠(푝, 푞, 푠) is a complete paranorm space. In section-2 we determine its Kothe-Toeplitz dual. In section-3 we establish necessary and sufficient conditions for a matrix A to map 푐푒푠 푝, 푞, 푠 to 푙∞ and 푐푒푠(푝, 푞, 푠) to c, where 푙∞ is the space of all bounded sequences and c is the space of all convergent sequences. We also get some known and unknown results as remarks.
Some properties of two-fuzzy Nor med spacesIOSR Journals
The study sheds light on the two-fuzzy normed space concentrating on some of their properties like convergence, continuity and the in order to study the relationship between these spaces
Learning a nonlinear embedding by preserving class neibourhood structure 최종WooSung Choi
Salakhutdinov, Ruslan, and Geoffrey E. Hinton. "Learning a nonlinear embedding by preserving class neighbourhood structure." International Conference on Artificial Intelligence and Statistics. 2007.
Periodic Function, Dirichlet's Condition, Fourier series, Even & Odd functions, Euler's Formula for Fourier Coefficients, Change of Interval, Fourier series in the intervals (0,2l), (-l,l) , (-pi, pi), (0, 2pi), Half Range Cosine & Sine series Root mean square, Complex Form of Fourier series, Parseval's Identity
Differential Geometry for Machine LearningSEMINARGROOT
References:
Differential Geometry of Curves and Surfaces, Manfredo P. Do Carmo (2016)
Differential Geometry by Claudio Arezzo
Youtube: https://youtu.be/tKnBj7B2PSg
What is a Manifold?
Youtube: https://youtu.be/CEXSSz0gZI4
Shape analysis (MIT spring 2019) by Justin Solomon
Youtube: https://youtu.be/GEljqHZb30c
Tensor Calculus
Youtube: https://youtu.be/kGXr1SF3WmA
Manifolds: A Gentle Introduction,
Hyperbolic Geometry and Poincaré Embeddings by Brian Keng
Link: http://bjlkeng.github.io/posts/manifolds/,
http://bjlkeng.github.io/posts/hyperbolic-geometry-and-poincare-embeddings/
Statistical Learning models for Manifold-Valued measurements with application to computer vision and neuroimaging by Hyunwoo J.Kim
In this paper, the concepts of sequences and series of complement normalized fuzzy numbers are introduced in terms of 𝛾-level, so that some properties and characterizations are presented, and some convergence theorems are proved
Improving Variational Inference with Inverse Autoregressive FlowTatsuya Shirakawa
This slide was created for NIPS 2016 study meetup.
IAF and other related researches are briefly explained.
paper:
Diederik P. Kingma et al., "Improving Variational Inference with Inverse Autoregressive Flow", 2016
https://papers.nips.cc/paper/6581-improving-variational-autoencoders-with-inverse-autoregressive-flow
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
3. Sequential Pattern Mining is used in many area
such as market-basket analysis, web usage
mining, language model etc.
Some of items have hierarchies and frequency
can be different
Ex)
Introduction
Photography
Analog CameraDigital Camera
Canon Nikon
Frequent!
Not Frequent!
4. MG-FSM , a state-of-the-art frequent sequence
miner, was suggested (SIGMOD, 2013) but,
doesn’t support hierarchies
Other sequential pattern mining
BFS: APRIORI, GSP, SPADE..
DFS: FP-Growth, PrefixSpan, SPAM, BIDE, GAP-
BIDE..
Related Work
6. In GSM, vocabulary is arranged in a hierarchy
𝑓𝑜𝑟 𝑢, 𝑣 ∈ 𝑊
if 𝑢 directly generalizes to v
𝑢 → 𝑣
if u generalizes to v (include itself)
𝑢 →∗ 𝑣
Hierarchies
𝑏11 𝑏11𝑏11
𝑏1 𝑏3𝑏2
𝐵
*
* *
14. In Preprocessing Phase, make f-list and total
order
𝑤1 < 𝑤2 𝑤ℎ𝑒𝑛 𝑓0 𝑤1, 𝐷 > 𝑓0 𝑤2, 𝐷
Ancestor is smaller than descendant
Preprocess
𝑇1 𝑎 𝑏1 𝑎 𝑏1
𝑇2 𝑎 𝑏3 𝑐 𝑐 𝑏2
𝑇3 𝑎 𝑐
𝑇4 𝑏11 𝑎 𝑒 𝑎
𝑇5 𝑎 𝑏12 𝑑1 𝑐
𝑇6 𝑏13 𝑓 𝑑2
f-list (𝜎 ≥ 2)
a : 5
B : 5
𝑏1: 4
c : 3
D : 2
total order : a<B<𝑏1<c<D
15. Generate Subsequence only if its element is
frequent
Ex) 𝑇4 ∶ 𝑏11 𝑎𝑒𝑎
𝐺𝜆=3,𝛾=1 𝑇4 = {𝑎𝑎, 𝑏1 𝑎, 𝑏1 𝑎𝑎, 𝐵𝑎, 𝐵𝑎𝑎}
Semi-Naïve Algorithm
f-list (𝜎 ≥ 2)
a : 5
B : 5
𝑏1: 4
c : 3
D : 2
𝑇4 𝑏11 𝑎 𝑒 𝑎
⊑1 𝑏11 𝑎
⊑1 𝑏1 𝑎 𝑎
⊑1 𝐵 𝑎
⊑1 𝐵 𝑒
⊑1 𝑎 𝑎
… …
16. total order : a<B<𝑏1<c<D (a is the most frequent)
p 𝑆 = 𝑚𝑎𝑥 𝑤∈𝑆 𝑆 , the pivot item of S (item which has
maximum order)
Ex) 𝑇1 = 𝑎𝑏1 𝑎𝑏1, 𝑝 𝑇1 = 𝑏1
A partition 𝑃𝑤 is a set of sequences which have w as pivot
Ex)T1 ∈ 𝑃𝑏1
, a ∈ 𝑃𝑎, 𝑎𝑎 ∈ 𝑃𝑎 …
from 𝑃𝑤, mine all generalized sequences that contain w
but no larger(in total order) item
Ex)𝑃𝑎 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑠 𝑜𝑓 ′𝑎′ 𝑠 𝑜𝑛𝑙𝑦, 𝑃𝐵 𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑠 𝑜𝑓 ′𝑎′𝑠 & ′𝐵′𝑠
Partition
17. total order : a<B<𝑏1<c<D (a is the most frequent)
𝐺𝜆=3,𝛾=1 𝑇4 = {𝑎𝑎, 𝑏1 𝑎, 𝑏1 𝑎𝑎, 𝐵𝑎, 𝐵𝑎𝑎}
Partition
𝑃𝑎 𝑎𝑎 ← 𝒂
𝑃𝐵 𝐵𝑎 𝐵𝑎𝑎 ← 𝑎 , 𝑩
𝑃𝑏1
𝑏1 𝑎 𝑏1 𝑎𝑎 ← 𝑎, 𝐵, 𝒃 𝟏
𝑃𝑐 ← 𝑎, 𝐵, 𝑏1, 𝒄
𝑃 𝐷 ← 𝑎, 𝐵, 𝑏1, 𝑐, 𝑫
18. two sequences T and T’ are w-equivalent
if 𝐺 𝑤,𝜆,𝛾(𝑇) = 𝐺 𝑤,𝜆,𝛾(𝑇′)
where
𝐺 𝑤,𝜆 ,𝛾 𝑇 = 𝑆 𝑆 ⊑ 𝛾 𝑇, 2 ≤ 𝑆 ≤ 𝜆, 𝑝 𝑆 = 𝑤}
total order : a<B<𝑏1<c<D
Ex) 𝑇4 ∶ 𝑏11 𝑎𝑒𝑎
𝐺 𝑤=𝐵,𝜆=3,𝛾=1(𝑇4) = {𝐵𝑎𝑎, 𝐵𝑎} = 𝐺 𝑤=𝐵,𝜆=3,𝛾=1(𝐵𝑎𝑎)
w-equivalency
𝑃𝑎 𝑎𝑎
𝑃𝐵 𝐵𝑎 𝐵𝑎𝑎
𝑃𝑏1
𝑏1 𝑎 𝑏1 𝑎𝑎
𝑃𝑐
𝑃 𝐷 Not necessary!
19. An item 𝑤′ is w-relevant if 𝑤′ ≤ 𝑤 (more frequent)
1) replace irrelevant items that doesn’t have an ancestor
𝑤′ < 𝑤 by the blank symbol ⊔
2) replace the items which are irrelevant and have an
ancestor that are smaller than the pivot
Ex) a<B<𝑏1<c<D (pivot B)
𝑇2 ∶ 𝑎𝑏3 𝑐𝑐𝑏2 →∗ 𝑇2
′
∶ 𝑎 𝐵⊔⊔ 𝐵 regarding pivot B
w-generalization
𝑇2 𝑎 𝑏3 𝑐 𝑐 𝑏2
1) 𝑎 𝐵 𝑐 𝑐 𝑏2
1) 𝑎 𝐵 𝑐 𝑐 𝐵
2) 𝑎 𝐵 ⨆ 𝑐 𝐵
𝑇2
′
𝑎 𝐵 ⨆ ⨆ 𝐵
21. Proposed Algorithm
For each Transaction 𝑇𝑖
generate 𝑇𝑖′ regarding each frequent item 𝑓𝑗
Divide 𝑇𝑖′ to each partition
Do local Mining
22. Local Mining can be done efficiently with PSM
instead of ‘Apriori’s (BFS,DFS)
Instead of Searching every frequent sequence,
LASH can enumerate efficiently a sequence has
the pivot
Ex) pivot : c, {abc, cab , abc,…}
don’t need to find {ab} because it doesn’t have {c}
Pivot Sequence Miner
27. LASH is the first parallel algorithm for mining
frequent sequence with hierarchies
LASH divides each sequence by pivot item and
performs local mining (PSM)
LASH can search better than MG-FSM ( state-of-
the-art Algorithm for frequent sequence miner
without hierarchies)
because of PSM
Conclusion
Editor's Notes
반복되는 패턴을 찾아낸다!
Market-basket Analysis : 아이템 A를 사면, B를 사더라.. 등의 정보를 찾아내면, 유용하게 쓸 수 있다.
Web Usage Mining : 사이트 A를 방문 한 뒤엔 꼭 사이트 B를 방문 하더라. 등의 정보를 찾아낼 수 있다.
Language Model : 어떤 단어와 함께 오는 단어, 연관 단어 등을 찾을 수 있다.
위의 예시처럼, 어떤 문서에서 반복되는 패턴을 찾는데 쓰인다.
추가로, 어떤 아이템들은 그림과 같은 구조가 있고, 각각의 세부 아이템들은 많이 안 나오더라도, 그것의 부모는 많이 나올 수 있다.
따라서.. 아이템 C를 사면 Canon은 많이 안 사더라도, photography는 많이 사더라. 를 찾아내고 싶은 것!
원래 MG-FSM는 빨랐으나 그림과 같은 구조를 반영 못함
맵리듀스 알고리즘으로 여러 컴퓨터에서 돌릴 수 있는 알고리즘 말고, 싱글머신용 알고리즘들은 BFS 스타일, DFS 스타일로 각각 다양한 알고리즘들이 존재함
문제!!!
각 아이템이 최대 감마 만큼 떨어져있을 수 있는 조건 하에서,
길이가 최대 람다인,
시그마 보다 자주 나오는
패턴을 찾아보자!
멍청하게는, 각각의 트랜잭션에 대하여, 모든 가능한 경우를 다 만들고
그 각각의 경우를 센 뒤에,
시그마보다 많이 나온걸 알려준다!
세미-나이브, LASH알고리즘은 Preprocess과정을 필요로하는데,
데이터를 full-scan해서, 어떤 아이템이 몇번 나오더라를 쭉 세고,
시그마보다 많이 나온것들만 정렬하여 순서를 만든다!
그럼, 아까 나이브에서는 각각의 트랜잭션에 대하여 모든 경우를 다 만들었지만,
이번에는 자주 나왔던 아이템들로만 가능한 조합을 만들어서 헤아려도 된다!
왜냐면, 자주 안나오는 단어가 있는 sequence가 자주 나올리가 없기 때문 (Apriori rule)
LASH에서는 조금 더 나아가서 각각의 트랜잭션에서 생성된 시퀀스들을 파티션이란 개념으로 나누고자 한다!
피벗은 어떤 시퀀스에서 가장 적은 프리퀀시를 가지는 아이템을 말한다.
파티션은, 피벗이 같은 시퀀스들의 집합
아까 세미-나이브의 예제는 이렇게 파티셔닝이 된다.
각 파티션에 모이는 transactio이 다음과 같은 분포를 이룬다고 하자!
(꼭 가우시안일 필요는 없지만, 어떤 파티션은 많고, 어떤 파티션은 적을 것이다.)
그럼 저렇게 적게 모이는 애들을 모아서 한 머신에 모으면 공평하게 모이지 않을까?
여기에 그림처럼! 각 머신에 모이는 트랜잭션 숫자가 비슷했으면 좋겠다!
이것을 문제로 표기하면 다음과 같다
n개의 파티션에 있는 시퀀스들의 총 양이 c_i 라고 하면, 그것들을 k개의 머신에 잘 나누어 담는 것이다.
목적은, 머신에 담긴 시퀀스의 총 양의 맥시멈을 미니마이즈 하는 것이다. (가장 무거운놈의 무게를 최소화)
이건 멀티프로세서 스케줄링 문젠데,
이걸 풀기위해
가장 무거운 것들을 가장 가벼운 머신에 넣는 그리디 선택을 한다고 하자!
이것은 저러한 바운드를 가지는 approximatio알고리즘이다.
각 파티션의 cost를 예측해야하는데.
이것 역시 쉽지 않다. 어떤 파티션에 몇 개나 몰릴건지 직접 세기 전에 어떻게 알것인가?!
이것을 확률로 계산하면 위와 같이 계산 할 수있다.
길이 L의 시퀀스의 각 아이템을 Frequent한 item셋에서 뽑는데, 뽑힐 확률은 frequenc와 같다.
이렇게 각 파티션에 들어갈 시퀀스가 생성될 확률을 계산하면
대충 어떤 파티션에 아이템이 얼마나 많을지 예측 할 수있고, 이것을 cost로 삼아서 LPT 알고리즘을 통해 파티셔닝을 할 수 있다.
각 파티션에 데이터를 보내는데, 보낼때, 그 파티션에 필요한 정보들만 남겨놓고 나머지는 자르거나, blank로 처리해서 보내는 편이 좋다.(압축이 잘됨)
따라서 어떤 정보만 남겨서 보낼 것인지를 w-equivalency와 w-generalizatio을 통해 설명한다.
적힌대로, pivot이 존재함을 이용하여, 기존의 싱글머신 Frequent Sequence Miner 보다 더 나은 탐색을 할 수 있는데, 그것이 PSM이다.