SlideShare a Scribd company logo
1 of 29
Download to read offline
Eirini Ntoutsi1,2, Myra Spiliopoulou3, Yannis Theodoridis1
1 Institute for Informatics, LMU, Germany
2 Dept of Informatics, Uni of Piraeus, Greece
3 School of Computer Science, Uni of Magdeburg, Germany
ICCSA 2011, Santander
Irene Ntoutsi ICCSA’11
 Motivation
 The evolution graph
 The FINGERPRINT of evolution
 Experiments
 Conclusions and outlook
Summarizing Cluster Evolution in Dynamic Environments 2
Irene Ntoutsi ICCSA’11
 More and more data are produced nowadays:
 Telcos, Banks, Health care systems, Retail industry, WWW …
 Modern data are dynamic
 A special category is data streams: possible infinite sequence of
elements arriving at a rapid rate
 Data Mining over such kind of data is even more challenging:
 Huge amounts of data  only a small amount can be stored in memory
 Arrival at a rapid rate  need for fast response time
 The generative distribution of the stream might change over time 
adapt and report on changes
e1 … …e2 e3 e4 ene5 e6 e7 e8 e9 e10
Time
Summarizing Cluster Evolution in Dynamic Environments 3
Irene Ntoutsi ICCSA’11
 Traditionally clustering is applied over static data
 Lately there are approaches that deal with modern data
Summarizing Cluster Evolution in Dynamic Environments
Adapt clusters to reflect current
state of the population.
• CluStream [Aggrawal et al, VLDB’03]
• DenStream [Cao et al, SDM’06]
• Dstream [Chen and Tu, KDD’07]
Trace changes and reason on them
so as to gain insights on the
population.
• FOCUS [Ganti et al, PODS’99]
• PANDA [Bartolini et al, KDE’09]
• MONIC [Spiliopoulou et al, KDD’06]
Dynamic Clustering
Approaches
Adaptation Tracing & Understanding
4
Irene Ntoutsi ICCSA’11
 Although there exist methods for:
 online cluster adaptation as the stream proceeds and
 change detection between clusterings extracted at different time points
 they do not deal with the efficient long-term maintenance of
the changes over an infinite stream of data
 To this end, we propose:
i. A graph representation of cluster changes/ transitions, and
ii. methods for condensing this graph into a FINGERPRINT
 The FINGERPRINT is a summary structure where similar clusters
are efficiently summarized, subject to an information loss
function.
Summarizing Cluster Evolution in Dynamic Environments 5
Irene Ntoutsi ICCSA’11
 Motivation
 The evolution graph
 The FINGERPRINT of evolution
 Experiments
 Conclusions and outlook
Summarizing Cluster Evolution in Dynamic Environments 6
Irene Ntoutsi ICCSA’11
 We consider a period of observation t1, t2, …, tm, …
 New records arrive over time and old records are subject to ageing according to a
window size parameter.
 Under these settings, we create at each time point t the dataset Dt
 At each t, we get a clustering Clt
 Clt might be the result of i) complete reclustering at t or ii) cluster adaptation from Clt-1
 Clustering evolution upon consecutive time points Clt-1, Clt is monitored
Summarizing Cluster Evolution in Dynamic Environments
…
d1
t1 t2 t3 tm
…
Datasets
d2 d3 dm
ClusteringsCl2 Cl3 Clm…Cl1
The monitoring process (window size = 2)
? ? ? ?
D2 D3 Dm
7
Irene Ntoutsi ICCSA’11

Summarizing Cluster Evolution in Dynamic Environments
C11
C21
C31
C12
C22
C23
C32
C41
C32
C13
C24
C32 C33
t_1 t_2 t_3 t_4
8
Irene Ntoutsi ICCSA’11

Summarizing Cluster Evolution in Dynamic Environments 9
Irene Ntoutsi ICCSA’11

Summarizing Cluster Evolution in Dynamic Environments 10
Irene Ntoutsi ICCSA’11
(External) transitions of cluster X in clustering Cl1 towards Cl2:
 survival
The best match of X in Cl2 is not a match for any other cluster in Cl1.
 absorption
There is a Y in Cl2 that is a match for X AND for one more cluster in Cl1.
 split
There are Y[1],…,Y[p] in Cl2 that together match X
AND the overlap of each one with X is at least τsplit.
 disappearance
X is not absorbed and not split and has not survived.
AND
 new cluster appearance
Υ in Cl2 is not involved in the external transitions of any X.
YX →
⊂
YX →
][],...,1[ pYYX →⊂
⊗→X
Y→⊗
Summarizing Cluster Evolution in Dynamic Environments 11
Irene Ntoutsi ICCSA’11
 EG is built incrementally as new clusterings arrive at t1, t2, …
 Whenever a new clustering Cli arrives at ti:
 Clusters of Cli are added as nodes to the EG and their labels are
computed
 We detect the cluster transitions w.r.t. Cli-1 and an edge is added to the
EG for each detected transition between clusters in Cli-1 and Cli
 Bookkeeping: The cluster members in Cli-1 are discarded, whereas the
members of the clusters in Cli are retained till the next time point ti+1
▪ We need these members to decide latter on the cluster transitions between
Cli and Cli+1
Summarizing Cluster Evolution in Dynamic Environments 12
Irene Ntoutsi ICCSA’11
 Motivation
 The evolution graph
 The FINGERPRINT of evolution
 Experiments
 Conclusions and outlook
Summarizing Cluster Evolution in Dynamic Environments 13
Irene Ntoutsi ICCSA’11
 We summarize EG so as cluster transitions are reflected but redundancies
are omitted.
 To this end, we summarize traces (sequences of cluster survivals) into some
condensed form, the fingerprint of the trace.
 For each emerged cluster c that appeared for the first time at t (i.e. a cluster
with no incoming edges at t), we define its cluster trace as a sequence of
cluster survivals:
trace(c) = <c1, c2, …, cm>
 First, we introduce the virtual center as the summary of a (sub)trace
 Let trace(c) = <c1, c2, …, cm>. Let X = <cj, …, cj+k> be a subtrace of it. The virtual center
of X is a derived node composed of the averages of the labels of the nodes in X:
where [i] is the i-th dimension
 To indicate that a cluster c has been mapped to a virtual center, we use the notation
Summarizing Cluster Evolution in Dynamic Environments 14
Irene Ntoutsi ICCSA’11

Summarizing Cluster Evolution in Dynamic Environments 15
Irene Ntoutsi ICCSA’11
 Information Loss
 Space Reduction
Summarizing Cluster Evolution in Dynamic Environments 16
Irene Ntoutsi ICCSA’1117

Irene Ntoutsi ICCSA’11
 Once the traces are summarized into
fingerprints, the evolution graph can be also
summarized
 We propose 2 summarization strategies:
 Incremental summarization of the graph
 Batch summarization of the graph
Summarizing Cluster Evolution in Dynamic Environments 18
Irene Ntoutsi ICCSA’11
 The traces are summarized incrementally as new clusterings
arrive over time
 If a new clustering arrives, we check whether there is some
cluster survival from the previous timepoint.
 Let x be a cluster that survives into a latter cluster y.
 If < τ, y is not added to the graph. Rather, x and y are
summarized into a virtual center v and x is replaced in the graph by v.
 Otherwise, the node y and the edge (x,y) are added to the graph
Summarizing Cluster Evolution in Dynamic Environments 19
Irene Ntoutsi ICCSA’11
 The summarization is performed over the whole trace based
on two heuristics:
 Heuristic A (deals with the violation of C2):
If T contains adjacent nodes that are in larger distance than τ from each
other split T as follows: detect the pair (c1, c2) with the largest distance
and split T into T1, T2 such that c1 is the last node of T1 and c2 is the first
node of T2
 Heuristic B (deals with the violation of C1):
If T satisfies C2 but contains nodes that are in larger distance than τ
from the virtual center vCenter(T), split T as follows: the node c that
has the maximum distance to vCenter(T) is detected and T is
partitioned into T1, T2 such that c is the last node of T1 and its
successor is the first node of T2
Summarizing Cluster Evolution in Dynamic Environments 20
Irene Ntoutsi ICCSA’11
 Motivation
 The evolution graph
 The FINGERPRINT of evolution
 Experiments
 Conclusions and outlook
Summarizing Cluster Evolution in Dynamic Environments 21
Irene Ntoutsi ICCSA’11
 We experiments with 3 datasets
 The Network Intrusion dataset: contains TCP connection logs from 2
weeks of LAN network traffic
▪ Numerical dataset
▪ Rapidly evolving
 The Charitable Donation dataset: contains information on people who
have made charitable donations in response to direct mailings
▪ Numerical dataset
▪ Relatively stable
 The ACM H2.8 dataset: the set of documents inserted in between 1997
and 2004 in the ACM Digital Library, category H2.8 on Database
Applications
▪ Text dataset
▪ Evolves in an unbalanced way
Summarizing Cluster Evolution in Dynamic Environments 22
Irene Ntoutsi ICCSA’11
 In 1998 we observe a new cluster on Information Systems which survives till
2000.
 The batch FINGERPRINT, condenses this trace into a single virtual center in 1
step:
 The incremental FINGERPRINT does the same in 2 steps:
 First summarizes c1998_2 and c1999_6 into a virtual center v0
 Then summarizes v0 and c2000_3 into a new virtual center
Summarizing Cluster Evolution in Dynamic Environments 23
Network intrusion dataset Charitable donation dataset
Summarizing Cluster Evolution in Dynamic Environments 24
Impact of threshold τ on space reduction
Network intrusion dataset Charitable donation dataset
Summarizing Cluster Evolution in Dynamic Environments 25
Impact of threshold τ on information loss
Irene Ntoutsi ICCSA’11
 Motivation
 The evolution graph
 The FINGERPRINT of evolution
 Experiments
 Conclusions and outlook
Summarizing Cluster Evolution in Dynamic Environments 26
Irene Ntoutsi ICCSA’11
 We presented the FINGERPRINT framework for summarizing
cluster evolution in a dynamic environment subject to
information loss and space reduction criteria
 Batch FINGERPRINT has better quality but requires the whole
graph as input. Some hybrid method might be interesting
 So far we summarize only cluster survivals. What about splits
and absorptions?
 The impact of clustering quality on the summarization
Summarizing Cluster Evolution in Dynamic Environments 27
Irene Ntoutsi ICCSA’11Summarizing Cluster Evolution in Dynamic Environments
Thank you
for your attention!
For further questions please contact me at:
ntoutsi@dbs.ifi.lmu.de
28
Irene Ntoutsi ICCSA’11
The speaker‘s attendance at this conference was sponsored by
the Alexander von Humboldt Foundation
http://www.humboldt-foundation.de
Summarizing Cluster Evolution in Dynamic Environments 29

More Related Content

Similar to Summarizing Cluster Evolution in Dynamic Environments

Distributed Beamforming in Sensor Networks
Distributed Beamforming in Sensor NetworksDistributed Beamforming in Sensor Networks
Distributed Beamforming in Sensor Networks
Daniel Tai
 
Transport and routing on coupled spatial networks
Transport and routing on coupled spatial networksTransport and routing on coupled spatial networks
Transport and routing on coupled spatial networks
richardgmorris
 
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
Mohamed Seif
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
Sarang Rakhecha
 
A novel approach for high speed convolution of finite and infinite length seq...
A novel approach for high speed convolution of finite and infinite length seq...A novel approach for high speed convolution of finite and infinite length seq...
A novel approach for high speed convolution of finite and infinite length seq...
eSAT Journals
 

Similar to Summarizing Cluster Evolution in Dynamic Environments (20)

Traveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed EnvironmentTraveling Salesman Problem in Distributed Environment
Traveling Salesman Problem in Distributed Environment
 
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENTTRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
TRAVELING SALESMAN PROBLEM IN DISTRIBUTED ENVIRONMENT
 
A Novel Design Architecture of Secure Communication System with Reduced-Order...
A Novel Design Architecture of Secure Communication System with Reduced-Order...A Novel Design Architecture of Secure Communication System with Reduced-Order...
A Novel Design Architecture of Secure Communication System with Reduced-Order...
 
Clustering Financial Time Series using their Correlations and their Distribut...
Clustering Financial Time Series using their Correlations and their Distribut...Clustering Financial Time Series using their Correlations and their Distribut...
Clustering Financial Time Series using their Correlations and their Distribut...
 
Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...Graph Spectra through Network Complexity Measures: Information Content of Eig...
Graph Spectra through Network Complexity Measures: Information Content of Eig...
 
Comparing ICH-leach and Leach Descendents on Image Transfer using DCT
Comparing ICH-leach and Leach Descendents on Image Transfer using DCT Comparing ICH-leach and Leach Descendents on Image Transfer using DCT
Comparing ICH-leach and Leach Descendents on Image Transfer using DCT
 
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKSEVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
EVOLUTIONARY CENTRALITY AND MAXIMAL CLIQUES IN MOBILE SOCIAL NETWORKS
 
Unit 4.pdf
Unit 4.pdfUnit 4.pdf
Unit 4.pdf
 
Distributed Beamforming in Sensor Networks
Distributed Beamforming in Sensor NetworksDistributed Beamforming in Sensor Networks
Distributed Beamforming in Sensor Networks
 
Mining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systemsMining of time series data base using fuzzy neural information systems
Mining of time series data base using fuzzy neural information systems
 
Project PPT
Project PPTProject PPT
Project PPT
 
Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data Cluster formation over huge volatile robotic data
Cluster formation over huge volatile robotic data
 
Transport and routing on coupled spatial networks
Transport and routing on coupled spatial networksTransport and routing on coupled spatial networks
Transport and routing on coupled spatial networks
 
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
Achievable Degrees of Freedom of the K-user MISO Broadcast Channel with Alter...
 
Normalization cross correlation value of
Normalization cross correlation value ofNormalization cross correlation value of
Normalization cross correlation value of
 
Continuous and Discrete Crooklet Transform
Continuous and Discrete Crooklet TransformContinuous and Discrete Crooklet Transform
Continuous and Discrete Crooklet Transform
 
Dynamic Data Community Discovery
Dynamic Data Community DiscoveryDynamic Data Community Discovery
Dynamic Data Community Discovery
 
Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03Digital Signal Processing[ECEG-3171]-Ch1_L03
Digital Signal Processing[ECEG-3171]-Ch1_L03
 
A novel approach for high speed convolution of finite and infinite length seq...
A novel approach for high speed convolution of finite and infinite length seq...A novel approach for high speed convolution of finite and infinite length seq...
A novel approach for high speed convolution of finite and infinite length seq...
 
THE KEY EXCHANGE CRYPTOSYSTEM USED WITH HIGHER ORDER DIOPHANTINE EQUATIONS
THE KEY EXCHANGE CRYPTOSYSTEM USED WITH HIGHER ORDER DIOPHANTINE EQUATIONSTHE KEY EXCHANGE CRYPTOSYSTEM USED WITH HIGHER ORDER DIOPHANTINE EQUATIONS
THE KEY EXCHANGE CRYPTOSYSTEM USED WITH HIGHER ORDER DIOPHANTINE EQUATIONS
 

More from Eirini Ntoutsi

More from Eirini Ntoutsi (11)

AAISI AI Colloquium 30/3/2021: Bias in AI systems
AAISI AI Colloquium 30/3/2021: Bias in AI systemsAAISI AI Colloquium 30/3/2021: Bias in AI systems
AAISI AI Colloquium 30/3/2021: Bias in AI systems
 
Fairness-aware learning: From single models to sequential ensemble learning a...
Fairness-aware learning: From single models to sequential ensemble learning a...Fairness-aware learning: From single models to sequential ensemble learning a...
Fairness-aware learning: From single models to sequential ensemble learning a...
 
Bias in AI-systems: A multi-step approach
Bias in AI-systems: A multi-step approachBias in AI-systems: A multi-step approach
Bias in AI-systems: A multi-step approach
 
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
Sentiment Analysis of Social Media Content: A multi-tool for listening to you...
 
A Machine Learning Primer,
A Machine Learning Primer,A Machine Learning Primer,
A Machine Learning Primer,
 
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
(Machine)Learning with limited labels(Machine)Learning with limited labels(Ma...
 
Redundancies in Data and their E ect on the Evaluation of Recommendation Syst...
Redundancies in Data and their Eect on the Evaluation of Recommendation Syst...Redundancies in Data and their Eect on the Evaluation of Recommendation Syst...
Redundancies in Data and their E ect on the Evaluation of Recommendation Syst...
 
Density-based Projected Clustering over High Dimensional Data Streams
Density-based Projected Clustering over High Dimensional Data StreamsDensity-based Projected Clustering over High Dimensional Data Streams
Density-based Projected Clustering over High Dimensional Data Streams
 
Density Based Subspace Clustering Over Dynamic Data
Density Based Subspace Clustering Over Dynamic DataDensity Based Subspace Clustering Over Dynamic Data
Density Based Subspace Clustering Over Dynamic Data
 
Discovering and Monitoring Product Features and the Opinions on them with OPI...
Discovering and Monitoring Product Features and the Opinions on them with OPI...Discovering and Monitoring Product Features and the Opinions on them with OPI...
Discovering and Monitoring Product Features and the Opinions on them with OPI...
 
NeeMo@OpenCoffeeXX
NeeMo@OpenCoffeeXXNeeMo@OpenCoffeeXX
NeeMo@OpenCoffeeXX
 

Recently uploaded

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Recently uploaded (20)

social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 

Summarizing Cluster Evolution in Dynamic Environments

  • 1. Eirini Ntoutsi1,2, Myra Spiliopoulou3, Yannis Theodoridis1 1 Institute for Informatics, LMU, Germany 2 Dept of Informatics, Uni of Piraeus, Greece 3 School of Computer Science, Uni of Magdeburg, Germany ICCSA 2011, Santander
  • 2. Irene Ntoutsi ICCSA’11  Motivation  The evolution graph  The FINGERPRINT of evolution  Experiments  Conclusions and outlook Summarizing Cluster Evolution in Dynamic Environments 2
  • 3. Irene Ntoutsi ICCSA’11  More and more data are produced nowadays:  Telcos, Banks, Health care systems, Retail industry, WWW …  Modern data are dynamic  A special category is data streams: possible infinite sequence of elements arriving at a rapid rate  Data Mining over such kind of data is even more challenging:  Huge amounts of data  only a small amount can be stored in memory  Arrival at a rapid rate  need for fast response time  The generative distribution of the stream might change over time  adapt and report on changes e1 … …e2 e3 e4 ene5 e6 e7 e8 e9 e10 Time Summarizing Cluster Evolution in Dynamic Environments 3
  • 4. Irene Ntoutsi ICCSA’11  Traditionally clustering is applied over static data  Lately there are approaches that deal with modern data Summarizing Cluster Evolution in Dynamic Environments Adapt clusters to reflect current state of the population. • CluStream [Aggrawal et al, VLDB’03] • DenStream [Cao et al, SDM’06] • Dstream [Chen and Tu, KDD’07] Trace changes and reason on them so as to gain insights on the population. • FOCUS [Ganti et al, PODS’99] • PANDA [Bartolini et al, KDE’09] • MONIC [Spiliopoulou et al, KDD’06] Dynamic Clustering Approaches Adaptation Tracing & Understanding 4
  • 5. Irene Ntoutsi ICCSA’11  Although there exist methods for:  online cluster adaptation as the stream proceeds and  change detection between clusterings extracted at different time points  they do not deal with the efficient long-term maintenance of the changes over an infinite stream of data  To this end, we propose: i. A graph representation of cluster changes/ transitions, and ii. methods for condensing this graph into a FINGERPRINT  The FINGERPRINT is a summary structure where similar clusters are efficiently summarized, subject to an information loss function. Summarizing Cluster Evolution in Dynamic Environments 5
  • 6. Irene Ntoutsi ICCSA’11  Motivation  The evolution graph  The FINGERPRINT of evolution  Experiments  Conclusions and outlook Summarizing Cluster Evolution in Dynamic Environments 6
  • 7. Irene Ntoutsi ICCSA’11  We consider a period of observation t1, t2, …, tm, …  New records arrive over time and old records are subject to ageing according to a window size parameter.  Under these settings, we create at each time point t the dataset Dt  At each t, we get a clustering Clt  Clt might be the result of i) complete reclustering at t or ii) cluster adaptation from Clt-1  Clustering evolution upon consecutive time points Clt-1, Clt is monitored Summarizing Cluster Evolution in Dynamic Environments … d1 t1 t2 t3 tm … Datasets d2 d3 dm ClusteringsCl2 Cl3 Clm…Cl1 The monitoring process (window size = 2) ? ? ? ? D2 D3 Dm 7
  • 8. Irene Ntoutsi ICCSA’11  Summarizing Cluster Evolution in Dynamic Environments C11 C21 C31 C12 C22 C23 C32 C41 C32 C13 C24 C32 C33 t_1 t_2 t_3 t_4 8
  • 9. Irene Ntoutsi ICCSA’11  Summarizing Cluster Evolution in Dynamic Environments 9
  • 10. Irene Ntoutsi ICCSA’11  Summarizing Cluster Evolution in Dynamic Environments 10
  • 11. Irene Ntoutsi ICCSA’11 (External) transitions of cluster X in clustering Cl1 towards Cl2:  survival The best match of X in Cl2 is not a match for any other cluster in Cl1.  absorption There is a Y in Cl2 that is a match for X AND for one more cluster in Cl1.  split There are Y[1],…,Y[p] in Cl2 that together match X AND the overlap of each one with X is at least τsplit.  disappearance X is not absorbed and not split and has not survived. AND  new cluster appearance Υ in Cl2 is not involved in the external transitions of any X. YX → ⊂ YX → ][],...,1[ pYYX →⊂ ⊗→X Y→⊗ Summarizing Cluster Evolution in Dynamic Environments 11
  • 12. Irene Ntoutsi ICCSA’11  EG is built incrementally as new clusterings arrive at t1, t2, …  Whenever a new clustering Cli arrives at ti:  Clusters of Cli are added as nodes to the EG and their labels are computed  We detect the cluster transitions w.r.t. Cli-1 and an edge is added to the EG for each detected transition between clusters in Cli-1 and Cli  Bookkeeping: The cluster members in Cli-1 are discarded, whereas the members of the clusters in Cli are retained till the next time point ti+1 ▪ We need these members to decide latter on the cluster transitions between Cli and Cli+1 Summarizing Cluster Evolution in Dynamic Environments 12
  • 13. Irene Ntoutsi ICCSA’11  Motivation  The evolution graph  The FINGERPRINT of evolution  Experiments  Conclusions and outlook Summarizing Cluster Evolution in Dynamic Environments 13
  • 14. Irene Ntoutsi ICCSA’11  We summarize EG so as cluster transitions are reflected but redundancies are omitted.  To this end, we summarize traces (sequences of cluster survivals) into some condensed form, the fingerprint of the trace.  For each emerged cluster c that appeared for the first time at t (i.e. a cluster with no incoming edges at t), we define its cluster trace as a sequence of cluster survivals: trace(c) = <c1, c2, …, cm>  First, we introduce the virtual center as the summary of a (sub)trace  Let trace(c) = <c1, c2, …, cm>. Let X = <cj, …, cj+k> be a subtrace of it. The virtual center of X is a derived node composed of the averages of the labels of the nodes in X: where [i] is the i-th dimension  To indicate that a cluster c has been mapped to a virtual center, we use the notation Summarizing Cluster Evolution in Dynamic Environments 14
  • 15. Irene Ntoutsi ICCSA’11  Summarizing Cluster Evolution in Dynamic Environments 15
  • 16. Irene Ntoutsi ICCSA’11  Information Loss  Space Reduction Summarizing Cluster Evolution in Dynamic Environments 16
  • 18. Irene Ntoutsi ICCSA’11  Once the traces are summarized into fingerprints, the evolution graph can be also summarized  We propose 2 summarization strategies:  Incremental summarization of the graph  Batch summarization of the graph Summarizing Cluster Evolution in Dynamic Environments 18
  • 19. Irene Ntoutsi ICCSA’11  The traces are summarized incrementally as new clusterings arrive over time  If a new clustering arrives, we check whether there is some cluster survival from the previous timepoint.  Let x be a cluster that survives into a latter cluster y.  If < τ, y is not added to the graph. Rather, x and y are summarized into a virtual center v and x is replaced in the graph by v.  Otherwise, the node y and the edge (x,y) are added to the graph Summarizing Cluster Evolution in Dynamic Environments 19
  • 20. Irene Ntoutsi ICCSA’11  The summarization is performed over the whole trace based on two heuristics:  Heuristic A (deals with the violation of C2): If T contains adjacent nodes that are in larger distance than τ from each other split T as follows: detect the pair (c1, c2) with the largest distance and split T into T1, T2 such that c1 is the last node of T1 and c2 is the first node of T2  Heuristic B (deals with the violation of C1): If T satisfies C2 but contains nodes that are in larger distance than τ from the virtual center vCenter(T), split T as follows: the node c that has the maximum distance to vCenter(T) is detected and T is partitioned into T1, T2 such that c is the last node of T1 and its successor is the first node of T2 Summarizing Cluster Evolution in Dynamic Environments 20
  • 21. Irene Ntoutsi ICCSA’11  Motivation  The evolution graph  The FINGERPRINT of evolution  Experiments  Conclusions and outlook Summarizing Cluster Evolution in Dynamic Environments 21
  • 22. Irene Ntoutsi ICCSA’11  We experiments with 3 datasets  The Network Intrusion dataset: contains TCP connection logs from 2 weeks of LAN network traffic ▪ Numerical dataset ▪ Rapidly evolving  The Charitable Donation dataset: contains information on people who have made charitable donations in response to direct mailings ▪ Numerical dataset ▪ Relatively stable  The ACM H2.8 dataset: the set of documents inserted in between 1997 and 2004 in the ACM Digital Library, category H2.8 on Database Applications ▪ Text dataset ▪ Evolves in an unbalanced way Summarizing Cluster Evolution in Dynamic Environments 22
  • 23. Irene Ntoutsi ICCSA’11  In 1998 we observe a new cluster on Information Systems which survives till 2000.  The batch FINGERPRINT, condenses this trace into a single virtual center in 1 step:  The incremental FINGERPRINT does the same in 2 steps:  First summarizes c1998_2 and c1999_6 into a virtual center v0  Then summarizes v0 and c2000_3 into a new virtual center Summarizing Cluster Evolution in Dynamic Environments 23
  • 24. Network intrusion dataset Charitable donation dataset Summarizing Cluster Evolution in Dynamic Environments 24 Impact of threshold τ on space reduction
  • 25. Network intrusion dataset Charitable donation dataset Summarizing Cluster Evolution in Dynamic Environments 25 Impact of threshold τ on information loss
  • 26. Irene Ntoutsi ICCSA’11  Motivation  The evolution graph  The FINGERPRINT of evolution  Experiments  Conclusions and outlook Summarizing Cluster Evolution in Dynamic Environments 26
  • 27. Irene Ntoutsi ICCSA’11  We presented the FINGERPRINT framework for summarizing cluster evolution in a dynamic environment subject to information loss and space reduction criteria  Batch FINGERPRINT has better quality but requires the whole graph as input. Some hybrid method might be interesting  So far we summarize only cluster survivals. What about splits and absorptions?  The impact of clustering quality on the summarization Summarizing Cluster Evolution in Dynamic Environments 27
  • 28. Irene Ntoutsi ICCSA’11Summarizing Cluster Evolution in Dynamic Environments Thank you for your attention! For further questions please contact me at: ntoutsi@dbs.ifi.lmu.de 28
  • 29. Irene Ntoutsi ICCSA’11 The speaker‘s attendance at this conference was sponsored by the Alexander von Humboldt Foundation http://www.humboldt-foundation.de Summarizing Cluster Evolution in Dynamic Environments 29