# Data Mining Lecture_8(b).pptx

Lecture 8b: Clustering Validity, Minimum Description Length (MDL), Introduction to Information Theory, Co-clustering using MDL. (ppt,pdf) Chapter 2, Evimaria Terzi, Problems and Algorithms for Sequence Segmentations, Ph.D. Thesis (PDF)

1 of 23
Download to read offline

## Similar to Data Mining Lecture_8(b).pptx

DATA STRUCTURES unit 1.pptx
DATA STRUCTURES unit 1.pptxShivamKrPathak

chapter1.pdf ......................................
chapter1.pdf ......................................nourhandardeer3

Tri Merge Sorting Algorithm
Tri Merge Sorting AlgorithmAshim Sikder

Replica exchange MCMC
Replica exchange MCMC. .

19. algorithms and-complexity
19. algorithms and-complexityshowkat27

Lca seminar modified
Lca seminar modifiedInbok Lee

l1.ppt
l1.pptImXaib

Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004jeronimored

مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلةFares Al-Qunaieer

Recursion and Sorting Algorithms
Recursion and Sorting AlgorithmsAfaq Mansoor Khan

Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5

Counting Sort Lowerbound
Counting Sort Lowerbounddespicable me

### Similar to Data Mining Lecture_8(b).pptx(20)

DATA STRUCTURES unit 1.pptx
DATA STRUCTURES unit 1.pptx

chapter1.pdf ......................................
chapter1.pdf ......................................

l1.ppt
l1.ppt

Data Structures 6
Data Structures 6

Merge sort
Merge sort

Ch07 linearspacealignment
Ch07 linearspacealignment

Tri Merge Sorting Algorithm
Tri Merge Sorting Algorithm

Replica exchange MCMC
Replica exchange MCMC

19. algorithms and-complexity
19. algorithms and-complexity

Lec35
Lec35

Lca seminar modified
Lca seminar modified

Linear sorting
Linear sorting

Encoding survey
Encoding survey

l1.ppt
l1.ppt

l1.ppt
l1.ppt

Welcome to Introduction to Algorithms, Spring 2004
Welcome to Introduction to Algorithms, Spring 2004

مدخل إلى تعلم الآلة
مدخل إلى تعلم الآلة

Recursion and Sorting Algorithms
Recursion and Sorting Algorithms

Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!

Counting Sort Lowerbound
Counting Sort Lowerbound

## More from Subrata Kumer Paul

Chapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.pptSubrata Kumer Paul

Chapter 13. Trends and Research Frontiers in Data Mining.ppt
Chapter 13. Trends and Research Frontiers in Data Mining.pptSubrata Kumer Paul

Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.pptSubrata Kumer Paul

Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.pptSubrata Kumer Paul

Chapter 7. Advanced Frequent Pattern Mining.ppt
Chapter 7. Advanced Frequent Pattern Mining.pptSubrata Kumer Paul

Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.pptSubrata Kumer Paul

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.pptSubrata Kumer Paul

Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...Subrata Kumer Paul

Chapter 5. Data Cube Technology.ppt
Chapter 5. Data Cube Technology.pptSubrata Kumer Paul

Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.pptSubrata Kumer Paul

Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.pptSubrata Kumer Paul

Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptxSubrata Kumer Paul

### More from Subrata Kumer Paul(20)

Chapter 9. Classification Advanced Methods.ppt
Chapter 9. Classification Advanced Methods.ppt

Chapter 13. Trends and Research Frontiers in Data Mining.ppt
Chapter 13. Trends and Research Frontiers in Data Mining.ppt

Chapter 8. Classification Basic Concepts.ppt
Chapter 8. Classification Basic Concepts.ppt

Chapter 2. Know Your Data.ppt
Chapter 2. Know Your Data.ppt

Chapter 12. Outlier Detection.ppt
Chapter 12. Outlier Detection.ppt

Chapter 7. Advanced Frequent Pattern Mining.ppt
Chapter 7. Advanced Frequent Pattern Mining.ppt

Chapter 11. Cluster Analysis Advanced Methods.ppt
Chapter 11. Cluster Analysis Advanced Methods.ppt

Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt

Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...
Chapter 6. Mining Frequent Patterns, Associations and Correlations Basic Conc...

Chapter 5. Data Cube Technology.ppt
Chapter 5. Data Cube Technology.ppt

Chapter 3. Data Preprocessing.ppt
Chapter 3. Data Preprocessing.ppt

Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt
Chapter 4. Data Warehousing and On-Line Analytical Processing.ppt

Chapter 1. Introduction.ppt
Chapter 1. Introduction.ppt

Data Mining Lecture_8(a).pptx
Data Mining Lecture_8(a).pptx

Data Mining Lecture_9.pptx
Data Mining Lecture_9.pptx

Data Mining Lecture_7.pptx
Data Mining Lecture_7.pptx

Data Mining Lecture_10(b).pptx
Data Mining Lecture_10(b).pptx

Data Mining Lecture_6.pptx
Data Mining Lecture_6.pptx

Data Mining Lecture_11.pptx
Data Mining Lecture_11.pptx

Data Mining Lecture_4.pptx
Data Mining Lecture_4.pptx

## Recently uploaded

Application of eddy current in industry and domestic purposes.pptx
Application of eddy current in industry and domestic purposes.pptxsukantatechedu

ChoiAccess.pptbuhuhujhhuhuhuhuhuhhhhhhhhhhhhh
ChoiAccess.pptbuhuhujhhuhuhuhuhuhhhhhhhhhhhhhhazhamina

FSMR - FIRE SAFETY MAINTENANCE REPORT FORM.pdf
FSMR - FIRE SAFETY MAINTENANCE REPORT FORM.pdfArnold Saludares

Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!Mathias Magdowski

Laser And its Application's - Engineering Physics
Laser And its Application's - Engineering PhysicsPurva Nikam

Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...drezdzond

Paper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makersNoman khan

Microstructure of Hadfield Steels (Robert Hadﬁeld)
Microstructure of Hadfield Steels (Robert Hadﬁeld)MANICKAVASAHAM G

Lesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdff1002753214

【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证muvgemo

Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdfVikasKumar11936

seminar power point presentation by Getahun Shanko.pptx
seminar power point presentation by Getahun Shanko.pptxGetahunShankoKefeni

Amplitude modulation and Demodulation Techniques
Amplitude modulation and Demodulation TechniquesRich171473

fat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refiningteddymebratie

Integrity Constraints in Database Management System.pptx
Integrity Constraints in Database Management System.pptxPallaviPatil905338

MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024Tamer Emara

Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?Marian Marinov

Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh RajputGaurav Singh Rajput

### Recently uploaded(20)

Application of eddy current in industry and domestic purposes.pptx
Application of eddy current in industry and domestic purposes.pptx

ChoiAccess.pptbuhuhujhhuhuhuhuhuhhhhhhhhhhhhh
ChoiAccess.pptbuhuhujhhuhuhuhuhuhhhhhhhhhhhhh

FSMR - FIRE SAFETY MAINTENANCE REPORT FORM.pdf
FSMR - FIRE SAFETY MAINTENANCE REPORT FORM.pdf

Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!
Robust, Precise, Fast - Chose Two for Radiated EMC Measurements!

Laser And its Application's - Engineering Physics
Laser And its Application's - Engineering Physics

Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...
Chase Commerce Center History Nordberg manufacturing Rexnord Global power com...

Paper Machine Troubleshooting manual for paper makers
Paper Machine Troubleshooting manual for paper makers

Microstructure of Hadfield Steels (Robert Hadﬁeld)
Microstructure of Hadfield Steels (Robert Hadﬁeld)

Lesson2 Stoichiometry and mass balance.pdf
Lesson2 Stoichiometry and mass balance.pdf

Présentation IIRB 2024 Prévibest T. Leborgne
Présentation IIRB 2024 Prévibest T. Leborgne

Présentation de F. Joudelat Congrès IIRB février 2024
Présentation de F. Joudelat Congrès IIRB février 2024

【文凭定制】坎特伯雷大学毕业证学历认证
【文凭定制】坎特伯雷大学毕业证学历认证

Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf
Forged Fitting Socket Welding Standard- ASME-B16.11-2001.pdf

seminar power point presentation by Getahun Shanko.pptx
seminar power point presentation by Getahun Shanko.pptx

Amplitude modulation and Demodulation Techniques
Amplitude modulation and Demodulation Techniques

fat and edible oil processsing.ppt, refining
fat and edible oil processsing.ppt, refining

Integrity Constraints in Database Management System.pptx
Integrity Constraints in Database Management System.pptx

MedTech R&D - Tamer Emara - resume @2024
MedTech R&D - Tamer Emara - resume @2024

Microservices: Benefits, drawbacks and are they for me?
Microservices: Benefits, drawbacks and are they for me?

Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput
Basic Instrumentation Symbols | P&ID | PFD | Gaurav Singh Rajput

### Data Mining Lecture_8(b).pptx

• 1. DATA MINING LECTURE 8B Time series analysis and Sequence Segmentation Subrata Kumer Paul Assistant Professor, Dept. of CSE, BAUET sksubrata96@gmail.com
• 2. Sequential data • Sequential data (or time series) refers to data that appear in a specific order. • The order defines a time axis, that differentiates this data from other cases we have seen so far • Examples • The price of a stock (or of many stocks) over time • Environmental data (pressure, temperature, precipitation etc) over time • The sequence of queries in a search engine, or the frequency of a query over time • The words in a document as they appear in order • A DNA sequence of nucleotides • Event occurrences in a log over time • Etc… • Time series: usually we assume that we have a vector of numeric values that change over time.
• 3. Time-series data  Financial time series, process monitoring…
• 4. Why deal with sequential data? • Because all data is sequential  • All data items arrive in the data store in some order • In some (many) cases the order does not matter • E.g., we can assume a bag of words model for a document • In many cases the order is of interest • E.g., stock prices do not make sense without the time information.
• 5. Time series analysis • The addition of the time axis defines new sets of problems • Discovering periodic patterns in time series • Defining similarity between time series • Finding bursts, or outliers • Also, some existing problems need to be revisited taking sequential order into account • Association rules and Frequent Itemsets in sequential data • Summarization and Clustering: Sequence Segmentation
• 6. Sequence Segmentation • Goal: discover structure in the sequence and provide a concise summary • Given a sequence T, segment it into K contiguous segments that are as homogeneous as possible • Similar to clustering but now we require the points in the cluster to be contiguous • Commonly used for summarization of histograms in databases
• 7. Example t R t R Segmentation into 4 segments Homogeneity: points are close to the mean value (small error)
• 8. Basic definitions • Sequence T = {t1,t2,…,tN}: an ordered set of N d-dimensional real points tiЄRd • A K-segmentation S: a partition of T into K contiguous segments {s1,s2,…,sK}. • Each segment sЄS is represented by a single vector μsЄRd (the representative of the segment -- same as the centroid of a cluster) • Error E(S): The error of replacing individual points with representatives • Different error functions, define different representatives. • Sum of Squares Error (SSE): 𝐸 𝑆 = 𝑠∈𝑆 𝑡∈𝑠 𝑡 − 𝜇𝑠 2 • Representative of segment s with SSE: mean 𝜇𝑠 = 1 |𝑠| 𝑡∈𝑠 𝑡
• 9. Basic Definitions • Observation: a K-segmentation S is defined by K+1 boundary points 𝑏0, 𝑏1, … , 𝑏𝐾−1, 𝑏𝐾. • 𝑏0 = 0, 𝑏𝑘 = 𝑁 + 1 always. • We only need to specify 𝑏1, … , 𝑏𝐾−1 t R 𝑏0 𝑏1 𝑏2 𝑏3 𝑏4
• 10. The K-segmentation problem • Similar to K-means clustering, but now we need the points in the clusters to respect the order of the sequence. • This actually makes the problem easier.  Given a sequence T of length N and a value K, find a K-segmentation S = {s1, s2, …,sK} of T such that the SSE error E is minimized.
• 11. Optimal solution for the k-segmentation problem  [Bellman’61: The K-segmentation problem can be solved optimally using a standard dynamic- programming algorithm • Dynamic Programming: • Construct the solution of the problem by using solutions to problems of smaller size • Define the dynamic programming recursion • Build the solution bottom up from smaller to larger instances • Define the dynamic programming table that stores the solutions to the sub-problems
• 12. Rule of thumb • Most optimization problems where order is involved can be solved optimally in polynomial time using dynamic programming. • The polynomial exponent may be large though
• 13. Dynamic Programming Recursion • Terminology: • 𝑇[1, 𝑛]: subsequence {t1,t2,…,tn} for 𝑛 ≤ 𝑁 • 𝐸 𝑆[1, 𝑛], 𝑘 : error of optimal segmentation of subsequence 𝑇[1, 𝑛] with 𝑘 segments for 𝑘 ≤ 𝐾 • Dynamic Programming Recursion: 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 + 𝑗+1≤𝑡≤𝑛 𝑡 − 𝜇 𝑗+1,𝑛 2 Error of k-th (last) segment when the last segment is [j+1,n] Error of optimal segmentation S[1,j] with k-1 segments Minimum over all possible placements of the last boundary point 𝑏𝑘−1
• 14. • Two−dimensional table 𝐴[1 … 𝐾, 1 … 𝑁] 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 + 𝑗+1≤𝑡≤𝑛 𝑡 − 𝜇 𝑗+1,𝑛 2 • Fill the table top to bottom, left to right. N 1 1 K Dynamic programming table k n 𝐴 𝑘, 𝑛 = 𝐸 𝑆 1, 𝑛 , 𝑘 Error of optimal K-segmentation
• 15. Example R n-th point k = 3 Where should we place boundary 𝑏2 ? N 1 1 2 3 4 n 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 𝑏2 𝑏1
• 16. Example R n-th point k = 3 Where should we place boundary 𝑏2 ? N 1 1 2 3 4 n 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 𝑏2 𝑏1
• 17. Example R n-th point k = 3 Where should we place boundary 𝑏2 ? N 1 1 2 3 4 n 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 𝑏2 𝑏1
• 18. Example R n-th point k = 3 Where should we place boundary 𝑏2 ? N 1 1 2 3 4 n 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j≤n−1 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 𝑏2 𝑏1
• 19. Example R n-th point k = 3 Optimal segmentation S[1:n] N 1 1 2 3 4 n 𝑏2 𝑏1 The cell A[3,n] stores the error of the optimal solution 3-segmentation of T[1,n] In the cell (or in a different table) we also store the position n-3 of the boundary so we can trace back the segmentation n-3
• 20. Dynamic-programming algorithm • Input: Sequence T, length N, K segments, error function E() • For i=1 to N //Initialize first row – A[1,i]=E(T[1…i]) //Error when everything is in one cluster • For k=1 to K // Initialize diagonal – A[k,k] = 0 // Error when each point in its own cluster • For k=2 to K – For i=k+1 to N • A[k,i] = minj<i{A[k-1,j]+E(T[j+1…i])} • To recover the actual segmentation (not just the optimal cost) store also the minimizing values j
• 21. Algorithm Complexity • What is the complexity? • NK cells to fill • Computation per cell 𝐸 𝑆 1, 𝑛 , 𝑘 = min 𝑘≤j<n 𝐸 𝑆 1, 𝑗 , 𝑘 − 1 + 𝑗+1≤𝑡≤𝑛 𝑡 −
• 22. Heuristics • Top-down greedy (TD): O(NK) • Introduce boundaries one at the time so that you get the largest decrease in error, until K segments are created. • Bottom-up greedy (BU): O(NlogN) • Merge adjacent points each time selecting the two points that cause the smallest increase in the error until K segments • Local Search Heuristics: O(NKI) • Assign the breakpoints randomly and then move them so that you reduce the error
• 23. Other time series analysis • Using signal processing techniques is common for defining similarity between series • Fast Fourier Transform • Wavelets • Rich literature in the field
Current LanguageEnglish
Español
Portugues
Français
Deutsche
© 2024 SlideShare from Scribd